Trying to read and write parquet files from s3 with local spark

拟墨画扇 提交于 2019-12-03 16:54:26

I'm going to have to correct the post by himanshuIIITian slightly, (sorry).

  1. Use the s3a connector, not the older, obsolete, unmaintained, s3n. S3A is: faster, works with the newer S3 clusters (Seoul, Frankfurt, London, ...), scales better. S3N has fundamental performance issues which have only been fixed in the latest version of Hadoop by deleting that connector entirely. Move on.

  2. You cannot safely use s3 as a direct destination of a Spark query., not with the classic "FileSystem" committers available today. Write to your local file:// and then copy up the data afterwards, using the AWS CLI interface. You'll get better performance as well as the guarantees of reliable writing which you would normally expect from IO

To read and write parquet files from S3 with local Spark, you need to add following 2 dependencies in your sbt project-

"com.amazonaws" % "aws-java-sdk" % "1.7.4"
"org.apache.hadoop" % "hadoop-aws" % "2.7.3"

I am assuming its an sbt project. If its mvn then add following dependencies-

<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-java-sdk</artifactId>
    <version>1.7.4</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aws</artifactId>
    <version>2.7.3</version>
</dependency>

Then you need to set S3 credentials in sparkSession, like this-

val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate()
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "s3AccessKey")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "s3SecretKey")

And its done. Now, you can Read/Write a Parquet file to S3. For example:

sparkSession.read.parquet("s3n://bucket/abc.parquet")    //Read
df.write.parquet("s3n://bucket/xyz.parquet")    //Write

I hope it helps!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!