In the previous post I have attempted to make a case that it is feasible to perform some graph processing, (specifically the problem of finding components in a graph given edges), in query languages using user-defined-aggregation-functions.
The approach relied on going over edges one by one and continuously build up components. The components were built as a <Key, Value> pair where Key was the node and the Value was the array of nodes constituting a component.
For example for a dataset like this
Below would be the result
The above approach gives results in a map structure that is easier to join with other tables.
This approach as we noted in earlier post could need considerable compute power as the number of edges increase. This is due to the fact that as number of edges increase the component size inturn increases and this leads to O(n^3) complexity for the merge method which is considerable.
We also noted in the earlier post that a better approach would be to implement Weighted Quick Union with Path Compression(WQUPC) algorithm which has several optimizations ontop of the map approach. Some of them include
- Build trees to represent clusters.
- Reduce depth of individual trees as the algorithm progresses through edges.
- Use integer arrays vs heavier data structures
- No repetitions of clusters
NOTE: This approach would need atleast one array as big as the number of nodes in the graph.
There is one challenge in implementing the above approach in a horizontal-scale cluster setup ex: in Hadoop-Hive ecosystem.
Sets of edges need to be looked at in parallel and partial results need to be merged.
Below is an attempt at implementing WQUPC in Hive as an UDAF.
Below is the result in the form of <Node, Root of the cluster tree>
This approach is very close to O(m* log(n)) complexity where n is the number of edges and m is the number of connections.
It easily scales to 10s of millions of edges.