Design a system to keep top k frequent words real time

后端未结

关注

 4  1256

有刺的猬 2020-12-23 12:40

Suppose we want a system to keep top k frequent words appear in tweets in last one hour. How to design it?

I can come up with hashmap, heap, log or MapReduce but I c

4条回答

被撕碎了的回忆 (楼主)

2020-12-23 13:36

This set of problems is called data stream algorithms. In your particular case there are two that fit - "Lossy Counting" and "Sticky Sampling" This is the paper that explains them or this, with pictures. This is a more simplified introduction.

Edit: (too long, to fit into a comment)

Although these streaming algos do not discount expired data per-se, one can run for instance 60 sliding windows, one for each minute of the hour and then delete and create a new one every minute. The sliding window on top is used for queering, other for updates only. This gives you a 1m resolution.

Critiques says, that streaming algos are probabilistic, and would not give you exact count, while this is true, please compare for instance with Rici's algo here, one does control error frequency, and can make it very low if desired. As you stream grows you would want to set it in % from the stream size, rather than in absolute value.

Streaming algos are very memory efficient, which is the most important things when crunching large streams in real time. Compare with Rici's precise algo which requires a single host to keep all data in memory for the current sliding window. It might not scale well - increase rate 100/s -> 100k/s or time window size 1h -> 7d and you will run out of memory on a single host.

Hastables that are essential part of the Rici's algo require one continuous memory blob which becomes more and more problematic as they grow.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...