Detecting repeating consecutive values in large datasets with Spark

廉价感情. 提交于 2021-02-10 23:46:17

问题


Cheerz,

Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is there any other memory efficient strategy where I can run sorted data and check whether adjacent values(for the same key) are increasing in at least N consecutive observations, without recurring to groupByKey method?

I have designed an algorithm to do it with reduceByKey, but there is one problem, reduce seems to ignore data ordering and yells completely wrong results at the end.

Any ideas appreciated, thanks.


回答1:


There are a few ways you can approach this problem:

  • repartitionAndSortWithinPartitions with custom partitioner and ordering:

    • keyBy (name, timestamp) pairs
    • create custom partitioner which considers only the name
    • repartitionAndSortWithinPartitions using custom partitioner
    • use mapPartitions to iterate over data and yield matching sequences
  • sortBy(Key) - this is similar to the first solution but provides higher granularity at the cost of additional post-processing.

    • keyBy (name, timestamp) pairs
    • sortByKey
    • process individual partitions using mapPartitionsWithIndex keeping track of leading / trailing patterns for each partition
    • adjust final results to include patterns which span over more than one partitions
  • create fixed sized windows over sorted data using sliding from mllib.rdd.RDDFunctions.

    • sortBy (name, timestamp)
    • create sliding RDD and filter windows which cover multiple names
    • check if any window contains desired pattern.


来源:https://stackoverflow.com/questions/35579619/detecting-repeating-consecutive-values-in-large-datasets-with-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!