How to read multiple gzipped files from S3 into a single RDD?

前端 未结 3 1930
时光说笑
时光说笑 2020-12-09 17:38

I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:

s3:///proj         


        
3条回答
  •  情书的邮戳
    2020-12-09 17:57

    Note: Under Spark 1.2, the proper format would be as follows:

    val rdd = sc.textFile("s3n:////bar.*.gz")
    

    That's s3n://, not s3://

    You'll also want to put your credentials in conf/spark-env.sh as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

提交回复
热议问题