Bulk data migration through Spark SQL

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-04 05:11:57

In your solution, Spark will read entire table contents into one partition before it starts writing. One way you can avoid that is partitioning the reading part, but it requires a numeric sequential column in the source data:

Dataset<Row> ds = spark.read()
  .format("jdbc")
  .option("url", url)
  .option("driver", "com.mysql.jdbc.Driver")
  .option("dbtable", "bigdatatable")
  .option("user", "root")
  .option("password", "foobar")
  .option("partitionColumn", "NUMERIC_COL")
  .option("lowerBound", "1")
  .option("upperBound", "10000")
  .option("numPartitions", "64")
  .load();

In the example above, the column "NUMERIC_COL" must exist in the data and it should, ideally, vary uniformly from 1 to 10000. Of course, this is a lot of requirements and a column like that will probably not exist, so you should probably create a view in the database with a column like that, or you add it in the query (note that I used a generic SQL syntax, you will have to adapt for your DBMS):

String query = "(select mod(row_number(), 64) as NUMERIC_COL, * from bigdatatable) as foo"

Dataset<Row> ds = spark.read()
  .format("jdbc")
  .option("url", url)
  .option("driver", "com.mysql.jdbc.Driver")
  .option("dbtable", query)
  .option("user", "root")
  .option("password", "foobar")
  .option("partitionColumn", "NUMERIC_COL")
  .option("lowerBound", "0")
  .option("upperBound", "63")
  .option("numPartitions", "64")
  .load();
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!