Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端 未结 16 557
野趣味
野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

  1. Read 1 terabyte of content
  2. Make a co
16条回答
  •  眼角桃花
    2020-11-30 17:47

    I'd be quite tempted to use a DAWG (wikipedia, and a C# writeup with more details). It's simple enough to add a count field on the leaf nodes, efficient memory wise and performs very well for lookups.

    EDIT: Though have you tried simply using a Dictionary? Where > represents word and count? Perhaps you're trying to optimize too early?

    editor's note: This post originally linked to this wikipedia article, which appears to be about another meaning of the term DAWG: A way of storing all substrings of one word, for efficient approximate string-matching.

提交回复
热议问题