Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端 未结 16 561
野趣味
野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

  1. Read 1 terabyte of content
  2. Make a co
16条回答
  •  庸人自扰
    2020-11-30 17:54

    Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.

    http://en.wikipedia.org/wiki/Trie

    here is implementation of tire in java
    http://algs4.cs.princeton.edu/52trie/TrieST.java.html

提交回复
热议问题