How to specify subquery in the option “dbtable” in Spark-jdbc application while reading data from a table on Greenplum? [duplicate]

若如初见. 提交于 2019-12-02 13:35:18

Option to replace dbtable with subquery is a feature of the built-in JDBC data source. However Greenplum Spark Connector doesn't seem to provide such capabilities.

Specifically the source is identified by dbschema and dbtable where the latter one should be (emphasis mine):

The name of the Greenplum Database table. When reading from Greenplum Database, this table must reside in the Greenplum Database schema identified in the dbschema option value.

This explains the exception you get.

At the same time nothing in the code you've shared indicates that you actually need such feature. Since you don't apply any database specific logic the process might be simply rewritten as

import org.apache.spark.sql.functions.{col, lit}

val allColumns: Seq[String] = ???

val dataDF = spark.read.format("greenplum")
  .option("url", conUrl)
  .option("dbtable", "xx_lines")
  .option("dbschema", "dbanscience")
  .option("partitionColumn", "id")
  .option("user", devUsrName)
  .option("password", devPwd)
  .load()
  .where("year = 2017 and month=12")
  .select(allColumns map col:_*)
  .withColumn(flagCol, lit(0))

Please note that other options you use (upperBound, lowerBound, numPartitions) are neither supported nor required.

According to the official documentation:

Greenplum Database stores table data across segments. A Spark application using the Greenplum-Spark Connector to load a Greenplum Database table identifies a specific table column as a partition column. The Connector uses the data values in this column to assign specific table data rows on each Greenplum Database segment to one or more Spark partitions.

So as you see distribution mechanism is completely different from the built-in JDBC source.

Connector also provides an additional partitionsPerSegment option which sets:

The number of Spark partitions per Greenplum Database segment. Optional, the default value is 1 partition.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!