Recursively fetch file contents from subdirectories using sc.textFile

前端未结

关注

 2  1387

It seems that SparkContext textFile expects only files to be present in the given directory location - it does not either

(a) recurse or
(b) even

相关标签:

2条回答

野性不改

2020-12-05 05:43

I have found that these parameters must be set in the following way:

.set("spark.hive.mapred.supports.subdirectories","true") .set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")

0 讨论(0)

发布评论:

提交评论

加载中...

梦如初夏

2020-12-05 05:47

I was looking at an old version of FileInputFormat..

BEFORE setting the recursive config mapreduce.input.fileinputformat.input.dir.recursive

scala> sc.textFile("dev/*").count java.io.IOException: Not a file: file:/shared/sparkup/dev/audit-release/blank_maven_build

The default is null/not set which is evaluated as "false":

scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive") res1: String = null

AFTER:

Now set the value :

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")

Now retry the recursive operation:

scala>sc.textFile("dev/*/*").count .. res5: Long = 3481 So it works.

Update added / for full recursion per comment by @Ben

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复