I saw a few discussions on this but couldn\'t quite understand the right solution: I want to load a couple hundred files from S3 into an RDD. Here is how I\'m doing it now:
You may use sc.textFile to read multiple files.
sc.textFile
You can pass multiple file url with as its argument.
multiple file url
You can specify whole directories, use wildcards and even CSV of directories and wildcards.
directories
wildcards
Ex:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
Reference from this ans