Spark :How to generate file path to read from s3 with scala

廉价感情. 提交于 2020-01-05 03:35:45

问题


How do I generate and load multiple s3 file path in scala so that I can use :

   sqlContext.read.json ("s3://..../*/*/*")

I know I can use wildcards to read multiple files but is there any way so that I can generate the path ? For example my fIle structure looks like this: BucketName/year/month/day/files

       s3://testBucket/2016/10/16/part00000

These files are all jsons. The issue is I need to load just spacific duration of files, for eg. Say 16 days then I need to loado files for start day ( oct 16) : oct 1 to 16.

With 28 day duration for same start day I would like to read from Sep 18

Can some tell me any ways to do this ?


回答1:


You can take a look at this answer, You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:

sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

Or you can use AWS API to get the list of files locations and read those files using spark .

You can look into this answer to AWS S3 file search.




回答2:


You can generate comma separated path list: sqlContext.read.json (s3://testBucket/2016/10/16/,s3://testBucket/2016/10/15/,...);



来源:https://stackoverflow.com/questions/40068011/spark-how-to-generate-file-path-to-read-from-s3-with-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!