Graph processing in query languages using UDAFs

Introduction

Processing graphs is a fundamental problem area that has applications in several domains ex: networks. One of the building-block problems in graph processing is “finding connected components (sets of fully connected nodes)”.

It is not straight forward to express graph problems as structured queries unless the database itself exposes primitives for graph processing ex: graph oriented databases like neo4j might do this .

In this article I will attempt to present that graph processing problems can be expressed as queries as long as the database (or the datawarehouse) allows extending its core functionality through UDAF(User Defined Aggregation Function).

I will use Hadoop/Hive as the data warehouse system and use “finding connected components” as the illustrative graph problem. I will try to build code, examine results, make observations for future work and list few optimizations.

Problem Definition

Given the following edges in the graph

N1,N3
N1,N5
N2,N1
N2,N6
N7,N8
N10,N11
N20,N4
N9,N4

Find connected components as below in SQL

[[“N1″,”N3″,”N5″,”N2″,”N6”],[“N7″,”N8”],[“N10″,”N11”],[“N20″,”N4″,”N9”]]

Solution Strategy

Hadoop-Hive ecosystem allows for distributed processing of data using SQL-like language – HQL.
Transitive closure of unlimited depth arent possible in SQL directly but
UDAFs allow for processing edges row-by-row and accumulate partial results and merge them later to generate final results.
QuickUnion, WeightedQuickUnion, WeightedQuickUnion With path compression are algorithms that can help identify connected components faster.

Solution and Result

UDAF : List<List<Text>> components(column1, column2)

Query : “Select components(node1, node2) from edges”

Code and log

Code and log with optimization 1 (Use Maps instead of List to speedup “contains” checks and avoid nested loops)

Observations for future work

Other possibilities

Transitive closure problems in SQL can be resolved using this approach ex: finding top manager for a given employee, finding oldest ancestor.

Optimizations

Use Maps instead of List to speedup “contains” checks and avoid nested loops. This would give a result like this –

{“N9”:[“N20″,”N4″,”N9″],”N5”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N6”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N7”:[“N7″,”N8″],”N8”:[“N7″,”N8″],”N1”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N2”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N3”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N20”:[“N20″,”N4″,”N9″],”N4”:[“N20″,”N4″,”N9″],”N10”:[“N10″,”N11″],”N11”:[“N10″,”N11”]}

The final result can still be presented as a List<List<Text>>> as above, or in some other form, by modifying the final terminate() method.

Use WeightedQuickUnion with path compression till its time to terminate partial processing and then covert them to Map<Text, List<Text>>.

Conclusion

UDAFs can be leveraged to build graph processing primitives in database or data warehouse that provide features to support UDAFs.

Some databases that support UDAFs are

References

Union Find algorithms – http://algs4.cs.princeton.edu/15uf/
Transitive Closure – http://en.wikipedia.org/wiki/Transitive_closure
Aggregation functions in Oracle –http://docs.oracle.com/cd/B28359_01/appdev.111/b28425/aggr_functions.htm
Aggregation functions in MSSQL – http://msdn.microsoft.com/en-us/library/ms131051.aspx
Aggregation functions in PostgreSQL –http://www.postgresql.org/docs/9.1/static/xaggr.html
Neo4J DFS – StackOverflow Question –http://stackoverflow.com/questions/18112240/searching-for-strongly-connected-components-using-neo4j