Greenplum, Pivotal HD + Spark, or HAWQ for TBs of Structured Data?

I have TBs of structured data in a Greenplum DB. I need to run what is essentially a MapReduce job on my data.

I found myself reimplementing at least the features of MapReduce just so that this data would fit in memory (in a streaming fashion).

Then I decided to look elsewhere for a more complete solution.

I looked at Pivotal HD + Spark because I am using Scala and Spark benchmarks are a wow-factor. But I believe the datastore behind this, HDFS, is going to be less efficient than Greenplum. (NOTE the "I believe". I would be happy to know I am wrong but please give some evidence.)

So to keep with the Greenplum storage layer I looked at Pivotal's HAWQ which is basically Hadoop with SQL on Greenplum.

There are a lot of features lost with this approach. Mainly the use of Spark.

Or is it better to just go with the built-in Greenplum features?

So I am at the crossroads of not knowing which way is best. I want to process TBs of data that fits the relational DB model well, and I would like the benefits of Spark and MapReduce.

Am I asking for too much?

Before posting my answer, I want to rephrase the question based on my understanding (to make sure I understand the question correctly) as follows:

You have TBs of data that fits the relational DB model well, and you want to query the data using SQL most of the time (I think that's why you put it into Greenplum DB), but sometimes you want to use Spark and MapReduce to access the data because of their flexibility.

If my understanding is correct, I strongly recommend that you should have a try with HAWQ. Some features of HAWQ make it fit your requirements perfectly (Note: I may be biased, since I am a developer of HAWQ).

First of all, HAWQ is a SQL on Hadoop database, which means it employs HDFS as its datastore. HAWQ doesn't keep with the Greenplum DB storage layer.

Secondly, it is hard to argue against that "HDFS is going to less efficient than Greenplum". But the performance difference is not as significant as you might think. We have done some optimizations for accessing HDFS data. One example is that, if we find one data block is stored locally, we read it directly from disk rather than through normal RPC calls.

Thirdly, there is a feature with HAWQ named HAWQ InputFormat for MapReduce (which Greenplum DB doesn't have). With that feature, you can write Spark and MapReduce code to access the HAWQ data easily and efficiently. Different from the DBInputFormat provided by Hadoop (which would make the master become the performance bottleneck, since all the data goes through the master first), HAWQ InputFormat for MapReduce lets your Spark and MapReduce code access the HAWQ data stored in HDFS directly. It is totally distributed, and thus is very efficient.

Lastly, of course, you still can use SQL to query your data with HAWQ, just like what you do with Greenplum DB.

Have you tried using Spark - JDBC connector to read the Spark data ?

Use the partition column, lower bound, upper bound, and numPartitions to split the greenplum table across multiple Spark workers.

For example, you can use this example

import java.util.Random

import org.apache.spark.sql.SQLContext

import org.apache.spark.{SparkConf, SparkContext}



object SparkGreenplumApp extends App {

  val conf = new SparkConf().setAppName("SparkGreenplumTest")

  val sparkContext = new SparkContext(conf)

   val sqlContext = new SQLContext(sparkContext)

  import sqlContext.implicits._


 val df = sqlContext.load("jdbc", Map(
    "url" -> "jdbc:postgresql://servername:5432/databasename?user=username&password=*******",
    "dbtable" -> "(select col, col2, col3 where datecol > '2017-01-01' and datecol < '2017-02-02' ) as events",
    "partitionColumn"-> "ID",
    "lowerBound"->"100",
    "upperBound"->"500",
    "numPartitions"->"2",
    "driver" -> "org.postgresql.Driver"))
}

来源：https://stackoverflow.com/questions/32531383/greenplum-pivotal-hd-spark-or-hawq-for-tbs-of-structured-data

标签

scala

Hadoop

apache-spark

greenplum

hawq