Recursively fetch file contents from subdirectories using sc.textFile

前端 未结 2 1387
没有蜡笔的小新
没有蜡笔的小新 2020-12-05 05:21

It seems that SparkContext textFile expects only files to be present in the given directory location - it does not either

  • (a) recurse or
  • (b) even
相关标签:
2条回答
  • 2020-12-05 05:43

    I have found that these parameters must be set in the following way:

    .set("spark.hive.mapred.supports.subdirectories","true")
    .set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
    
    0 讨论(0)
  • 2020-12-05 05:47

    I was looking at an old version of FileInputFormat..

    BEFORE setting the recursive config mapreduce.input.fileinputformat.input.dir.recursive

    scala> sc.textFile("dev/*").count
         java.io.IOException: Not a file: file:/shared/sparkup/dev/audit-release/blank_maven_build
    

    The default is null/not set which is evaluated as "false":

    scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")
    res1: String = null
    

    AFTER:

    Now set the value :

    sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
    

    Now retry the recursive operation:

    scala>sc.textFile("dev/*/*").count
    
    ..
    res5: Long = 3481
    
    So it works.
    

    Update added / for full recursion per comment by @Ben

    0 讨论(0)
提交回复
热议问题