How to run streaming query on updated lines in CSV file?

后端未结

关注

 2  824

I have one csv file in a folder that is keep on updating continuously. I need to take inputs from this csv file and produce some transactions. How can I take data from the c

相关标签:

2条回答

一生所求

2020-12-11 23:41
Firstly, I'm not sure how you arrive here, because a csv file should be written sequencially, which is able to achieve a better Input/Output. So my recommendation is that you create an append-only file, and try to get the stream data like getting data from binlog.

However if you have to do this, I think StreamingContext may help you.
```
val ssc = new StreamingContext(new SparkConf(), Durations.milliseconds(1))
val fileStream = ssc.fileStream[LongWritable, Text, TextInputFormat]("/tmp", (x: Path) => true, newFilesOnly = false).map(_._2.toString)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-12-11 23:53

I have 1 csv file in 1 folder location that is keep on updating everytime. i need to take inputs from this csv file and produce some transactions. how can i take data from csv file that is keep on updating , lets say every 5 minutes.

tl;dr It won't work.

Spark Structured Streaming by default monitors files in a directory and for every new file triggers a computation. Once a file has been processed, the file will never be processed again. That's the default implementation.

You could write your own streaming source that could monitor a file for changes, but that's a custom source development (which in most cases is not worth the effort yet doable).

0 讨论(0)
发布评论:

提交评论
- 加载中...