Spark + EMRFS/S3 - Is there a way to read client side encrypted data and write it back using server side encryption?

问题

I have a use-case in spark where I have to read data from a S3 that uses client-side encryption, process it and write it back using only server-side encryption. I'm wondering if there's a way to do this in spark?

Currently, I have these options set:

spark.hadoop.fs.s3.cse.enabled=true
spark.hadoop.fs.s3.enableServerSideEncryption=true
spark.hadoop.fs.s3.serverSideEncryption.kms.keyId=<kms id here>

But obviously, it's ending up using both CSE and SSE while writing the data. So, I'm wondering it it's possible to somehow only set spark.hadoop.fs.s3.cse.enabled to true while reading and then set it to false or maybe another alternative.

Thanks for the help.

回答1:

Using programmatic configuration to define multiple S3 filesystems:

spark.hadoop.fs.s3.cse.enabled=true
spark.hadoop.fs.s3sse.impl=foo.bar.S3SseFilesystem

and then add a custom implementation for s3sse:

package foo.bar

import java.net.URI

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.s3a.S3AFileSystem

class S3SseFilesystem extends S3AFileSystem {
  override def initialize(name: URI, originalConf: Configuration): Unit = {
    val conf = new Configuration()
    // NOTE: no prefix spark.hadoop here
    conf.set("fs.s3.enableServerSideEncryption", "true")
    conf.set("fs.s3.serverSideEncryption.kms.keyId", "<kms id here>")
    super.initialize(name, conf)
  }
}

After this, the custom file system can be used with Spark read method

spark.read.json("s3sse://bucket/prefix")

来源：https://stackoverflow.com/questions/62869519/spark-emrfs-s3-is-there-a-way-to-read-client-side-encrypted-data-and-write-i

标签

apache-spark

amazon-s3

encryption

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!