How to run streaming query on updated lines in CSV file?

后端 未结 2 824
甜味超标
甜味超标 2020-12-11 23:19

I have one csv file in a folder that is keep on updating continuously. I need to take inputs from this csv file and produce some transactions. How can I take data from the c

相关标签:
2条回答
  • 2020-12-11 23:41

    Firstly, I'm not sure how you arrive here, because a csv file should be written sequencially, which is able to achieve a better Input/Output. So my recommendation is that you create an append-only file, and try to get the stream data like getting data from binlog.

    However if you have to do this, I think StreamingContext may help you.

    val ssc = new StreamingContext(new SparkConf(), Durations.milliseconds(1))
    val fileStream = ssc.fileStream[LongWritable, Text, TextInputFormat]("/tmp", (x: Path) => true, newFilesOnly = false).map(_._2.toString)
    
    0 讨论(0)
  • 2020-12-11 23:53

    I have 1 csv file in 1 folder location that is keep on updating everytime. i need to take inputs from this csv file and produce some transactions. how can i take data from csv file that is keep on updating , lets say every 5 minutes.

    tl;dr It won't work.

    Spark Structured Streaming by default monitors files in a directory and for every new file triggers a computation. Once a file has been processed, the file will never be processed again. That's the default implementation.

    You could write your own streaming source that could monitor a file for changes, but that's a custom source development (which in most cases is not worth the effort yet doable).

    0 讨论(0)
提交回复
热议问题