Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端未结

关注

 16  557

野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

Read 1 terabyte of content
Make a co

16条回答

眼角桃花 (楼主)

2020-11-30 17:47

I'd be quite tempted to use a DAWG (wikipedia, and a C# writeup with more details). It's simple enough to add a count field on the leaf nodes, efficient memory wise and performs very well for lookups.

EDIT: Though have you tried simply using a Dictionary? Where > represents word and count? Perhaps you're trying to optimize too early?
editor's note: This post originally linked to this wikipedia article, which appears to be about another meaning of the term DAWG: A way of storing all substrings of one word, for efficient approximate string-matching.
0 讨论(0) 查看其它16个回答发布评论: 提交评论加载中...