[A presentation on this topic is here]
Why HAWQ?
- In-memory, distributed, SQL datastore that allows transactions (for catalog data : via postgresql , for user data : it is append only) as first-class hadoop service, with out of the box integration to HDFS, Hive, HBase.
- 40x faster than hive which has already got significant performance gains (35x-45x) with stinger initiative.
- Open source. Apache project. Support (Hortonworks).
- Hive/Tez for batch processing (alternatives being MR and Pig) and interactive queries, while HAWQ is a strong candidate for hosting reporting schema (fact tables) and serve as backend for BI tools.
below are few use cases that help illustrate hawq and its integration features. (Sample codes and results are linked).
Prerequisites
- Install hawq and PXF on a hadoop cluster (below installation section covers this).
- Reasonble specs on each segment node (ex: 4 cores, 20 GB).
Use case – 0 – Transactions
- Sample SQL with results is here.
Use case – 1 – HDFS – HAWQ
- HAWQ external table using HDFS location of an existing compressed file via pxf.
- Sample code to load data to hdfs, create hawq external tables and verify them is here and results are here.
Use case – 2 – Hive – HAWQ
- HAWQ external table using existing Hive table.
- Sample code to create and load hive tables, create corresponding hawq external tables and verify them is here and results are here.
Use case – 3 – HBase – HAWQ
- HAWQ external table using existing hbase table.
- Sample code to create and load one hbase table, create corresponding hawq external table and verify it is here and results are here (look for ‘order_lineitems_hawq=512071’).
Installation
- In HDP 2.4 : try this for sandbox, same should work in a clustered mode with few modifications
- In HDP 2.5 : HAWQ runs on slider
- If you attempted installing HAWQ on 2.4 try removing HAWQ service upgrade to HDP 2.5 and then try adding service back. Only this time it would also auto suggest/select slider. Ensure you try installing master on a node that doesn’t have master binaries through previous attempts.
- Install PXF (hawq extension framework) service through ambari (below section covers it) .
PXF
- PXF installation in HDP 2.5 fails currently for lack of pxf json rpm. Go ahead and install PXF, it would setup other rpms and below steps would correct json rpm issue
- build pxf-json rpm manually from here
- git checkout 2.0.0.0-incubating
- cd pxf
- ./gradlew clean build buildRpm -x test
- install pxf-json rpm on namenode and all slaves.
- Restart PXF services
Further Extensions
For use cases 1,2,3
- CTAS, with filters, on above external tables to create a new tables.
- Perform selects and joins with the new tables.
- Generate reports on new tables and run interactive queries.
References
- HAWQ: a massively parallel processing SQL engine in hadoop
- apache-hawq-next-step-in-massively-parallel-processing
- http://hdb.docs.pivotal.io/archives/hdb-docs-2.0.pdf
- https://community.hortonworks.com/articles/39997/installing-hawq-on-2400-hortonworks-sandbox.html
- https://github.com/apache/incubator-hawq/tree/master/pxf
- https://github.com/pivotalsoftware/pivotal-samples
- http://pivotalhd-210.docs.pivotal.io/tutorial/getting-started/hawq/pxf-external-tables.html
- http://pivotalhd-210.docs.pivotal.io/doc/2010/HAWQDataTypes.html
- http://pivotalhd-210.docs.pivotal.io/tutorial/getting-started/hawq/pxf-hbase-external-tables.html
- https://blog.pivotal.io/pivotal-cloud-foundry/products/java-reporting-engine-is-now-available-on-cloud-foundry-via-jasperreports
This post was later published on LinkedIn here.