问题
Does Apache Spark SQL support MERGE clause that's similar to Oracle's MERGE SQL clause?
MERGE into <table> using (
select * from <table1>
when matched then update...
DELETE WHERE...
when not matched then insert...
)
回答1:
It does with Delta Lake as storage format : df.write.format("delta").save("/data/events")
.
DeltaTable.forPath(spark, "/data/events/")
.as("events")
.merge(
updatesDF.as("updates"),
"events.eventId = updates.eventId")
.whenMatched
.updateExpr(
Map("data" -> "updates.data"))
.whenNotMatched
.insertExpr(
Map(
"date" -> "updates.date",
"eventId" -> "updates.eventId",
"data" -> "updates.data"))
.execute()
You also need the delta package:
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.11</artifactId>
<version>xxxx</version>
</dependency>
See https://docs.delta.io/0.4.0/delta-update.html for more details
回答2:
It does not. As of now (it might change in the future) Spark doesn't support UPDATES
, DELETES
or any other variant of record modification.
It can only overwrite existing storage (with different implementation depending on the source) or append with plain INSERT
.
回答3:
If you are working over Spark, maybe this answers could help you to lead with the merge issue using DataFrames.
Anyway, reading some documentation of Hortonworks, it says that Merge sentence is supported in Apache Hive 0.14 and later.
回答4:
you can write your custom code: Below code you can edit to go with merge instead of Insert. Make sure this is computation heavy operations. but get y
df.rdd.coalesce(2).foreachPartition(partition => {
val connectionProperties = brConnect.value
val jdbcUrl = connectionProperties.getProperty("jdbcurl")
val user = connectionProperties.getProperty("user")
val password = connectionProperties.getProperty("password")
val driver = connectionProperties.getProperty("Driver")
Class.forName(driver)
val dbc: Connection = DriverManager.getConnection(jdbcUrl, user, password)
val db_batchsize = 1000
var pstmt: PreparedStatement = null
partition.grouped(db_batchsize).foreach(batch => {
batch.foreach{ row =>
{
val id = row.id
val fname = row.fname
val lname = row.lname
val userid = row.userid
println(id, fname)
val sqlString = "INSERT employee USING " +
" values (?, ?, ?, ?) "
var pstmt: PreparedStatement = dbc.prepareStatement(sqlString)
pstmt.setLong(1, row.id)
pstmt.setString(2, row.fname)
pstmt.setString(3, row.lname)
pstmt.setString(4, row.userid)
pstmt.addBatch()
pstmt.executeBatch()
}
}
//pstmt.executeBatch()
dbc.commit()
pstmt.close()
})
dbc.close()
} )
来源:https://stackoverflow.com/questions/46613907/does-apache-spark-sql-support-merge-clause