pyspark select subset of files using regex/glob from s3

前端 未结 1 965
旧巷少年郎
旧巷少年郎 2020-12-12 01:40

I have a number files each segregated by date (date=yyyymmdd) on amazon s3. The files go back 6 months but I would like to restrict my script to only use the la

相关标签:
1条回答
  • 2020-12-12 02:03

    For your first option, use curly braces:

    sc.textFile("s3://path_to_dir/yyyy{m1,m2,m3}*")
    

    For your second option, you can read each single glob into an RDD and then union those RDDs into a single one:

    m1 = sc.textFile("s3://path_to_dir/yyyym1*")
    m2 = sc.textFile("s3://path_to_dir/yyyym2*")
    m3 = sc.textFile("s3://path_to_dir/yyyym3*")
    all = m1.union(m2).union(m3)
    

    You can use globs with sc.textFile but not full regular expressions.

    0 讨论(0)
提交回复
热议问题