Does Apache Spark SQL support MERGE clause?

柔情痞子 提交于 2020-06-16 04:34:21

问题


Does Apache Spark SQL support MERGE clause that's similar to Oracle's MERGE SQL clause?

MERGE into <table> using (
  select * from <table1>
    when matched then update...
       DELETE WHERE...
    when not matched then insert...
)

回答1:


It does with Delta Lake as storage format : df.write.format("delta").save("/data/events").

DeltaTable.forPath(spark, "/data/events/")
  .as("events")
  .merge(
    updatesDF.as("updates"),
    "events.eventId = updates.eventId")
  .whenMatched
  .updateExpr(
    Map("data" -> "updates.data"))
  .whenNotMatched
  .insertExpr(
    Map(
      "date" -> "updates.date",
      "eventId" -> "updates.eventId",
      "data" -> "updates.data"))
  .execute()

You also need the delta package:

<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_2.11</artifactId>
  <version>xxxx</version>
</dependency>

See https://docs.delta.io/0.4.0/delta-update.html for more details




回答2:


It does not. As of now (it might change in the future) Spark doesn't support UPDATES, DELETES or any other variant of record modification.

It can only overwrite existing storage (with different implementation depending on the source) or append with plain INSERT.




回答3:


If you are working over Spark, maybe this answers could help you to lead with the merge issue using DataFrames.

Anyway, reading some documentation of Hortonworks, it says that Merge sentence is supported in Apache Hive 0.14 and later.




回答4:


you can write your custom code: Below code you can edit to go with merge instead of Insert. Make sure this is computation heavy operations. but get y

  df.rdd.coalesce(2).foreachPartition(partition => {
  val connectionProperties = brConnect.value
  val jdbcUrl = connectionProperties.getProperty("jdbcurl")
  val user = connectionProperties.getProperty("user")
  val password = connectionProperties.getProperty("password")
  val driver = connectionProperties.getProperty("Driver")
  Class.forName(driver)

  val dbc: Connection = DriverManager.getConnection(jdbcUrl, user, password)
  val db_batchsize = 1000
  var pstmt: PreparedStatement = null

  partition.grouped(db_batchsize).foreach(batch => {
    batch.foreach{ row =>
      {
        val id = row.id
        val fname = row.fname
        val lname = row.lname
        val userid = row.userid
        println(id, fname)
        val sqlString = "INSERT employee USING   " +
        " values (?, ?, ?, ?) "

        var pstmt: PreparedStatement = dbc.prepareStatement(sqlString)
        pstmt.setLong(1, row.id)
        pstmt.setString(2, row.fname)
        pstmt.setString(3, row.lname)
        pstmt.setString(4, row.userid)
        pstmt.addBatch()
        pstmt.executeBatch()
      }

    }
    //pstmt.executeBatch()
    dbc.commit()
    pstmt.close()
  })
  dbc.close()
} )


来源:https://stackoverflow.com/questions/46613907/does-apache-spark-sql-support-merge-clause

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!