问题
I have a use-case in spark where I have to read data from a S3 that uses client-side encryption, process it and write it back using only server-side encryption. I'm wondering if there's a way to do this in spark?
Currently, I have these options set:
spark.hadoop.fs.s3.cse.enabled=true
spark.hadoop.fs.s3.enableServerSideEncryption=true
spark.hadoop.fs.s3.serverSideEncryption.kms.keyId=<kms id here>
But obviously, it's ending up using both CSE and SSE while writing the data. So, I'm wondering it it's possible to somehow only set spark.hadoop.fs.s3.cse.enabled to true while reading and then set it to false or maybe another alternative.
Thanks for the help.
回答1:
Using programmatic configuration to define multiple S3 filesystems:
spark.hadoop.fs.s3.cse.enabled=true
spark.hadoop.fs.s3sse.impl=foo.bar.S3SseFilesystem
and then add a custom implementation for s3sse
:
package foo.bar
import java.net.URI
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.s3a.S3AFileSystem
class S3SseFilesystem extends S3AFileSystem {
override def initialize(name: URI, originalConf: Configuration): Unit = {
val conf = new Configuration()
// NOTE: no prefix spark.hadoop here
conf.set("fs.s3.enableServerSideEncryption", "true")
conf.set("fs.s3.serverSideEncryption.kms.keyId", "<kms id here>")
super.initialize(name, conf)
}
}
After this, the custom file system can be used with Spark read
method
spark.read.json("s3sse://bucket/prefix")
来源:https://stackoverflow.com/questions/62869519/spark-emrfs-s3-is-there-a-way-to-read-client-side-encrypted-data-and-write-i