Partitioning in spark while reading from RDBMS via JDBC

我的梦境 提交于 2019-11-26 02:15:04

问题


I am running spark in cluster mode and reading data from RDBMS via JDBC.

As per Spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple workers:

  • partitionColumn
  • lowerBound
  • upperBound
  • numPartitions

These are optional parameters.

What would happen if I don\'t specify these:

  • Only 1 worker read the whole data?
  • If it still reads parallelly, how does it partition data?

回答1:


If you don't specify either {partitionColumn, lowerBound, upperBound, numPartitions} or {predicates} Spark will use a single executor and create a single non-empty partition. All data will be processed using a single transaction and reads will be neither distributed nor parallelized.

See also:

  • How to optimize partitioning when migrating data from JDBC source?
  • How to improve performance for slow Spark jobs using DataFrame and JDBC connection?


来源:https://stackoverflow.com/questions/43150694/partitioning-in-spark-while-reading-from-rdbms-via-jdbc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!