Reading multiple files from S3 in parallel (Spark, Java)

前端 未结 3 1492
天涯浪人
天涯浪人 2021-02-04 10:52

I saw a few discussions on this but couldn\'t quite understand the right solution: I want to load a couple hundred files from S3 into an RDD. Here is how I\'m doing it now:

3条回答
  •  青春惊慌失措
    2021-02-04 11:42

    You may use sc.textFile to read multiple files.

    You can pass multiple file url with as its argument.

    You can specify whole directories, use wildcards and even CSV of directories and wildcards.

    Ex:

    sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
    

    Reference from this ans

提交回复
热议问题