发表新帖

发表新帖

Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端未结

关注

 16  580

野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

Read 1 terabyte of content
Make a co

16条回答

醉酒成梦 (楼主)

2020-11-30 17:59
Well the 1st thought is to manage a dtabase in form of hashtable /Array or whatever to save each words occurence, but according to the data size i would rather:
- Get the 1st 10 found words where occurence >= 2
- Get how many times these words occure in the entire string and delete them while counting
- Start again, once you have two sets of 10 words each you get the most occured 10 words of both sets
- Do the same for the rest of the string(which dosnt contain these words anymore).
You can even try to be more effecient and start with 1st found 10 words where occurence >= 5 for example or more, if not found reduce this value until 10 words found. Throuh this you have a good chance to avoid using memory intensivly saving all words occurences which is a huge amount of data, and you can save scaning rounds (in a good case)

But in the worst case you may have more rounds than in a conventional algorithm.

By the way its a problem i would try to solve with a functional programing language rather than OOP.
0 讨论(0)

查看其它16个回答
发布评论:

提交评论
- 加载中...

热议问题