Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端 未结 16 581
野趣味
野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

  1. Read 1 terabyte of content
  2. Make a co
16条回答
  •  暗喜
    暗喜 (楼主)
    2020-11-30 17:45

    You can try a map-reduce approach for this task. The advantage of map-reduce is scalability, so even for 1TB, or 10TB or 1PB - the same approach will work, and you will not need to do a lot of work in order to modify your algorithm for the new scale. The framework will also take care for distributing the work among all machines (and cores) you have in your cluster.

    First - Create the (word,occurances) pairs.
    The pseudo code for this will be something like that:

    map(document):
      for each word w:
         EmitIntermediate(w,"1")
    
    reduce(word,list):
       Emit(word,size(list))
    

    Second you can find the ones with the topK highest occurances easily with a single iteration over the pairs, This thread explains this concept. The main idea is to hold a min-heap of top K elements, and while iterating - make sure the heap always contains the top K elements seen so far. When you are done - the heap contains the top K elements.

    A more scalable (though slower if you have few machines) alternative is you use the map-reduce sorting functionality, and sort the data according to the occurances, and just grep the top K.

提交回复
热议问题