How to read multiple gzipped files from S3 into a single RDD?

前端 未结 3 1929
时光说笑
时光说笑 2020-12-09 17:38

I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:

s3:///proj         


        
3条回答
  •  感动是毒
    2020-12-09 18:08

    Using AWS EMR with Spark 2.0.0 and SparkR in RStudio I've managed to read the gz compressed wikipedia stat files stored in S3 using the below command:

    df <- read.text("s3:///pagecounts-20110101-000000.gz")
    

    Similarly, for all files under 'Jan 2011' you can use the above command like below:

    df <- read.text("s3:///pagecounts-201101??-*.gz")
    

    See the SparkR API docs for more ways of doing it. https://spark.apache.org/docs/latest/api/R/read.text.html

提交回复
热议问题