发表新帖

发表新帖

Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端未结

关注

 16  581

野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

Read 1 terabyte of content
Make a co

16条回答

暗喜 (楼主)

2020-11-30 17:45
You can try a map-reduce approach for this task. The advantage of map-reduce is scalability, so even for 1TB, or 10TB or 1PB - the same approach will work, and you will not need to do a lot of work in order to modify your algorithm for the new scale. The framework will also take care for distributing the work among all machines (and cores) you have in your cluster.

First - Create the (word,occurances) pairs.
The pseudo code for this will be something like that:
```
map(document):
  for each word w:
     EmitIntermediate(w,"1")

reduce(word,list):
   Emit(word,size(list))
```
Second you can find the ones with the topK highest occurances easily with a single iteration over the pairs, This thread explains this concept. The main idea is to hold a min-heap of top K elements, and while iterating - make sure the heap always contains the top K elements seen so far. When you are done - the heap contains the top K elements.

A more scalable (though slower if you have few machines) alternative is you use the map-reduce sorting functionality, and sort the data according to the occurances, and just grep the top K.
0 讨论(0)

查看其它16个回答
发布评论:

提交评论
- 加载中...

热议问题