问题
I have a number files each segregated by date (date=yyyymmdd)
on amazon s3. The files go back 6 months but I would like to restrict my script to only use the last 3 months of data. I am unsure as to whether I will be able to use regular expressions to do something like sc.textFile("s3://path_to_dir/yyyy[m1,m2,m3]*")
where m1,m2,m3 represents the 3 months from the current date that I would like to use.
One discussion also suggested using something like sc.textFile("s3://path_to_dir/yyyym1*","s3://path_to_dir/yyyym2*","s3://path_to_dir/yyyym3*")
but that doesn't seem to work for me.
Does sc.textFile( )
take regular expressions? I know you can use glob expressions but I was unsure how to represent the above case as a glob expression?
回答1:
For your first option, use curly braces:
sc.textFile("s3://path_to_dir/yyyy{m1,m2,m3}*")
For your second option, you can read each single glob into an RDD and then union those RDDs into a single one:
m1 = sc.textFile("s3://path_to_dir/yyyym1*")
m2 = sc.textFile("s3://path_to_dir/yyyym2*")
m3 = sc.textFile("s3://path_to_dir/yyyym3*")
all = m1.union(m2).union(m3)
You can use globs with sc.textFile
but not full regular expressions.
来源:https://stackoverflow.com/questions/31543766/pyspark-select-subset-of-files-using-regex-glob-from-s3