Does Apache Spark SQL support MERGE clause?

问题

Does Apache Spark SQL support MERGE clause that's similar to Oracle's MERGE SQL clause?

MERGE into <table> using (
  select * from <table1>
    when matched then update...
       DELETE WHERE...
    when not matched then insert...
)

回答1:

It does with Delta Lake as storage format : df.write.format("delta").save("/data/events").

DeltaTable.forPath(spark, "/data/events/")
  .as("events")
  .merge(
    updatesDF.as("updates"),
    "events.eventId = updates.eventId")
  .whenMatched
  .updateExpr(
    Map("data" -> "updates.data"))
  .whenNotMatched
  .insertExpr(
    Map(
      "date" -> "updates.date",
      "eventId" -> "updates.eventId",
      "data" -> "updates.data"))
  .execute()

You also need the delta package:

<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_2.11</artifactId>
  <version>xxxx</version>
</dependency>

See https://docs.delta.io/0.4.0/delta-update.html for more details

回答2:

It does not. As of now (it might change in the future) Spark doesn't support UPDATES, DELETES or any other variant of record modification.

It can only overwrite existing storage (with different implementation depending on the source) or append with plain INSERT.

回答3:

If you are working over Spark, maybe this answers could help you to lead with the merge issue using DataFrames.

Anyway, reading some documentation of Hortonworks, it says that Merge sentence is supported in Apache Hive 0.14 and later.

回答4:

you can write your custom code: Below code you can edit to go with merge instead of Insert. Make sure this is computation heavy operations. but get y

  df.rdd.coalesce(2).foreachPartition(partition => {
  val connectionProperties = brConnect.value
  val jdbcUrl = connectionProperties.getProperty("jdbcurl")
  val user = connectionProperties.getProperty("user")
  val password = connectionProperties.getProperty("password")
  val driver = connectionProperties.getProperty("Driver")
  Class.forName(driver)

  val dbc: Connection = DriverManager.getConnection(jdbcUrl, user, password)
  val db_batchsize = 1000
  var pstmt: PreparedStatement = null

  partition.grouped(db_batchsize).foreach(batch => {
    batch.foreach{ row =>
      {
        val id = row.id
        val fname = row.fname
        val lname = row.lname
        val userid = row.userid
        println(id, fname)
        val sqlString = "INSERT employee USING   " +
        " values (?, ?, ?, ?) "

        var pstmt: PreparedStatement = dbc.prepareStatement(sqlString)
        pstmt.setLong(1, row.id)
        pstmt.setString(2, row.fname)
        pstmt.setString(3, row.lname)
        pstmt.setString(4, row.userid)
        pstmt.addBatch()
        pstmt.executeBatch()
      }

    }
    //pstmt.executeBatch()
    dbc.commit()
    pstmt.close()
  })
  dbc.close()
} )

来源：https://stackoverflow.com/questions/46613907/does-apache-spark-sql-support-merge-clause

标签

sql

Hadoop

apache-spark

apache-spark-sql

databricks