How to load history data when starting Spark Streaming process, and calculate running aggregations

柔情痞子 提交于 2019-12-05 19:17:58

RDDs are immutable, so after they are created you cannot add data to them, for example updating the revenue with new events.

What you can do is union the existing data with the new events to create a new RDD, which you can then use as the current total. For example...

var currentTotal: RDD[(Key, Value)] = ... //read from ElasticSearch
messages.foreachRDD { rdd =>
    currentTotal = currentTotal.union(rdd)
}

In this case we make currentTotal a var since it will be replaced by the reference to the new RDD when it gets unioned with the incoming data.

After the union you may want to perform some further operations such as reducing the values which belong to the same Key, but you get the picture.

If you use this technique note that the lineage of your RDDs will grow, as each newly created RDD will reference its parent. This can cause a stack overflow style lineage problem. To fix this you can call checkpoint() on the RDD periodically.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!