Greenplum, Pivotal HD + Spark, or HAWQ for TBs of Structured Data?

狂风中的少年 提交于 2019-12-04 13:35:35

Before posting my answer, I want to rephrase the question based on my understanding (to make sure I understand the question correctly) as follows:

You have TBs of data that fits the relational DB model well, and you want to query the data using SQL most of the time (I think that's why you put it into Greenplum DB), but sometimes you want to use Spark and MapReduce to access the data because of their flexibility.

If my understanding is correct, I strongly recommend that you should have a try with HAWQ. Some features of HAWQ make it fit your requirements perfectly (Note: I may be biased, since I am a developer of HAWQ).

First of all, HAWQ is a SQL on Hadoop database, which means it employs HDFS as its datastore. HAWQ doesn't keep with the Greenplum DB storage layer.

Secondly, it is hard to argue against that "HDFS is going to less efficient than Greenplum". But the performance difference is not as significant as you might think. We have done some optimizations for accessing HDFS data. One example is that, if we find one data block is stored locally, we read it directly from disk rather than through normal RPC calls.

Thirdly, there is a feature with HAWQ named HAWQ InputFormat for MapReduce (which Greenplum DB doesn't have). With that feature, you can write Spark and MapReduce code to access the HAWQ data easily and efficiently. Different from the DBInputFormat provided by Hadoop (which would make the master become the performance bottleneck, since all the data goes through the master first), HAWQ InputFormat for MapReduce lets your Spark and MapReduce code access the HAWQ data stored in HDFS directly. It is totally distributed, and thus is very efficient.

Lastly, of course, you still can use SQL to query your data with HAWQ, just like what you do with Greenplum DB.

Have you tried using Spark - JDBC connector to read the Spark data ?

Use the partition column, lower bound, upper bound, and numPartitions to split the greenplum table across multiple Spark workers.

For example, you can use this example

import java.util.Random

import org.apache.spark.sql.SQLContext

import org.apache.spark.{SparkConf, SparkContext}



object SparkGreenplumApp extends App {

  val conf = new SparkConf().setAppName("SparkGreenplumTest")

  val sparkContext = new SparkContext(conf)

   val sqlContext = new SQLContext(sparkContext)

  import sqlContext.implicits._


 val df = sqlContext.load("jdbc", Map(
    "url" -> "jdbc:postgresql://servername:5432/databasename?user=username&password=*******",
    "dbtable" -> "(select col, col2, col3 where datecol > '2017-01-01' and datecol < '2017-02-02' ) as events",
    "partitionColumn"-> "ID",
    "lowerBound"->"100",
    "upperBound"->"500",
    "numPartitions"->"2",
    "driver" -> "org.postgresql.Driver"))
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!