Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端 未结 16 580
野趣味
野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

  1. Read 1 terabyte of content
  2. Make a co
16条回答
  •  醉酒成梦
    2020-11-30 17:59

    Well the 1st thought is to manage a dtabase in form of hashtable /Array or whatever to save each words occurence, but according to the data size i would rather:

    • Get the 1st 10 found words where occurence >= 2
    • Get how many times these words occure in the entire string and delete them while counting
    • Start again, once you have two sets of 10 words each you get the most occured 10 words of both sets
    • Do the same for the rest of the string(which dosnt contain these words anymore).

    You can even try to be more effecient and start with 1st found 10 words where occurence >= 5 for example or more, if not found reduce this value until 10 words found. Throuh this you have a good chance to avoid using memory intensivly saving all words occurences which is a huge amount of data, and you can save scaning rounds (in a good case)

    But in the worst case you may have more rounds than in a conventional algorithm.

    By the way its a problem i would try to solve with a functional programing language rather than OOP.

提交回复
热议问题