Spark streaming DStream RDD to get file name

前端 未结 3 2050
轮回少年
轮回少年 2020-12-16 20:12

Spark streaming textFileStream and fileStream can monitor a directory and process the new files in a Dstream RDD.

How to get the file names

3条回答
  •  误落风尘
    2020-12-16 20:48

    Alternatively, by modifying FileInputDStream so that rather than loading the contents of the files into the RDD, it simply creates an RDD from the filenames.

    This gives a performance boost if you don't actually want to read the data itself into the RDD, or want to pass filenames to an external command as one of your steps.

    Simply change filesToRDD(..) so that it makes an RDD of the filenames, rather than loading the data into the RDD.

    See: https://github.com/HASTE-project/bin-packing-paper/blob/master/spark/spark-scala-cellprofiler/src/main/scala/FileInputDStream2.scala#L278

提交回复
热议问题