Realtime tracking of top 100 twitter words per min/hour/day

问题

I recently came across this interview question:

Given a continuous twitter feed, design an algorithm to return the 100 most
frequent words used at this minute, this hour and this day.

I was thinking of a system with a hash map of word -> count linked to 3 min-heaps for the current min, hour and day.

Every incoming message is tokenized, sanitized and the word counts updated in the hash map (and increase-key in the heaps if the word already exists in it)

If any of the words don't exist in the heap (and heap size == 100) check if their frequency > min value in the heap and if so then extract-min and insert into the heap.

Are there better ways of doing this?

回答1:

Your algorithm is a good start, but it is not going to produce correct results. The problem is that the hash tables the way you describe them are a one-way street: once a word gets added, it stays counted forever.

You need an array of 1440 (24*60) word+count hash maps organized the way that you describe; these are your minute-by-minute counts. You need two additional hash maps - for the rolling total of the hour and the day.

Define two operations on hash maps - add and subtract, with the semantic of merging counts of identical words, and removing words when their count drops to zero.

Each minute you start a new hash map, and update counts from the feed. At the end of the minute, you place that hash map into the array for the current minute, add it to the rolling total for the hour and for the day, and then subtract the hash map of an hour ago from the hourly running total, and subtract the hash map of 24 hours ago from the daily running total.

Finally, you need a way to produce the top 100 words given a hash map. This should be a trivial task - add items to an array of word+count entries, sort on the count, and keep the top 100.

回答2:

dasblinkenlight made a good point for a mistake of not excluding items out of your hash map.

There is one more thing to add though, to actually compute the top K words given a min/hour/day, it is faster to use partition (O(n)) rather than sorting (O(nlgn)):

dump a hashmap of a min/hour/day's word counts into an array: O(n)
use median-of-median selection to get the K-th elem: O(n)
partition around the K-th elem: O(n)

HTH.

来源：https://stackoverflow.com/questions/10189685/realtime-tracking-of-top-100-twitter-words-per-min-hour-day

标签

algorithm

nlp