发表新帖

发表新帖

Spark streaming DStream RDD to get file name

前端未结

关注

 3  2050

轮回少年 2020-12-16 20:12

Spark streaming textFileStream and fileStream can monitor a directory and process the new files in a Dstream RDD.

How to get the file names

3条回答

误落风尘 (楼主)

2020-12-16 20:48

Alternatively, by modifying FileInputDStream so that rather than loading the contents of the files into the RDD, it simply creates an RDD from the filenames.

This gives a performance boost if you don't actually want to read the data itself into the RDD, or want to pass filenames to an external command as one of your steps.

Simply change filesToRDD(..) so that it makes an RDD of the filenames, rather than loading the data into the RDD.

See: https://github.com/HASTE-project/bin-packing-paper/blob/master/spark/spark-scala-cellprofiler/src/main/scala/FileInputDStream2.scala#L278

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题