Graph processing in query languages – Part III using Hive and HBase

As they say, it’s never too early/late to catch-up with a friend or a problem. Here’s round 3 on graph problems (part 2, part 1), using Connected Components as illustrative example, with vanilla hive and HBase and one UDF function .

Approach

HBase is great at large number of (on-demand) columns, random writes and reads and maintaining history
Against each node(row key), treat every neighbor as a column with a fixed (can be extended for weighted graphs) value of 1. Include self loop.
Pick node and least neighbor and insert into another hbase table with just node and component-id(least neighbor)
Use joins (based on GIMV) to refine component-ids. Iterate till convergence is achieved. (In several ways join is also similar to a BSP approach in Pregel/Giraph)
Use hive as the language for above steps

Solution is here

Testing

Illustrative log on a small graph is here. Its summary is here
Benchmark has not been done

Features

Incremental processing of graphs is feasible
Should theoretically scale for web scale graphs – sparse or dense.
Though based on GIMV – Matrix Vector multiplication isnt needed. GIMV is a very useful base to model solutions like above for other graph problems that can be solved by it.

References

Previous posts – part 2, part 1
GIMV paper and pegasus project
BSP based approach in Pregel paper
Another GIMV/HBase approach paper

Illustration