Processing graphs is a fundamental problem area that has applications in several domains ex: networks. One of the building-block problems in graph processing is “finding connected components (sets of fully connected nodes)”.
It is not straight forward to express graph problems as structured queries unless the database itself exposes primitives for graph processing ex: graph oriented databases like neo4j might do this .
In this article I will attempt to present that graph processing problems can be expressed as queries as long as the database (or the datawarehouse) allows extending its core functionality through UDAF(User Defined Aggregation Function).
I will use Hadoop/Hive as the data warehouse system and use “finding connected components” as the illustrative graph problem. I will try to build code, examine results, make observations for future work and list few optimizations.
Given the following edges in the graph
Find connected components as below in SQL
- Union Find and Percolation
- Transitive closures in SQL
- With Recursive in SQL
- Hadoop-Hive ecosystem allows for distributed processing of data using SQL-like language – HQL.
- Transitive closure of unlimited depth arent possible in SQL directly but
- UDAFs allow for processing edges row-by-row and accumulate partial results and merge them later to generate final results.
- QuickUnion, WeightedQuickUnion, WeightedQuickUnion With path compression are algorithms that can help identify connected components faster.
Solution and Result
UDAF : List<List<Text>> components(column1, column2)
Query : “Select components(node1, node2) from edges”
Code and log
Code and log with optimization 1 (Use Maps instead of List to speedup “contains” checks and avoid nested loops)
Observations for future work
Related problems that can be solved using above UDAF
Using UDAFs for graph processing as above allows for leveraging SQL for generic graph problem ex:
- finding if any two nodes, say N3 and N6 in above case, are connected,
- is graph fully connected,
- finding paths from one node to another, say N3 to N6
- finding set of largest size
- Incrementally solving any of the above problems i.e. as new edges get added incrementally solving above problems looking only at new edges, say edges table is partitioned by day and the above problems need to be solved on a daily basis.
- Build a library of graph processing UDAFs
- Transitive closure problems in SQL can be resolved using this approach ex: finding top manager for a given employee, finding oldest ancestor.
- Use Maps instead of List to speedup “contains” checks and avoid nested loops. This would give a result like this –
|The final result can still be presented as a List<List<Text>>> as above, or in some other form, by modifying the final terminate() method.
- Use WeightedQuickUnion with path compression till its time to terminate partial processing and then covert them to Map<Text, List<Text>>.
UDAFs can be leveraged to build graph processing primitives in database or data warehouse that provide features to support UDAFs.
Some databases that support UDAFs are
- MS SQL Server 2014
- Oracle 11g
- Union Find algorithms – http://algs4.cs.princeton.edu/15uf/
- Transitive Closure – http://en.wikipedia.org/wiki/Transitive_closure
- Aggregation functions in Oracle –http://docs.oracle.com/cd/B28359_01/appdev.111/b28425/aggr_functions.htm
- Aggregation functions in MSSQL – http://msdn.microsoft.com/en-us/library/ms131051.aspx
- Aggregation functions in PostgreSQL –http://www.postgresql.org/docs/9.1/static/xaggr.html
- Neo4J DFS – StackOverflow Question –http://stackoverflow.com/questions/18112240/searching-for-strongly-connected-components-using-neo4j