发表新帖

发表新帖

Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端未结

关注

 16  573

野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

Read 1 terabyte of content
Make a co

16条回答

心在旅途 (楼主)

2020-11-30 18:05

Very interesting question. It relates more to logic analysis than coding. With the assumption of English language and valid sentences it comes easier.

You don't have to count all words, just the ones with a length less than or equal to the average word length of the given language (for English is 5.1). Therefore you will not have problems with memory.

As for reading the file you should choose a parallel mode, reading chunks (size of your choice) by manipulating file positions for white spaces. If you decide to read chunks of 1MB for example all chunks except the first one should be a bit wider (+22 bytes from left and +22 bytes from right where 22 represents the longest English word - if I'm right). For parallel processing you will need a concurrent dictionary or local collections that you will merge.

Keep in mind that normally you will end up with a top ten as part of a valid stop word list (this is probably another reverse approach which is also valid as long as the content of the file is ordinary).

0 讨论(0)

查看其它16个回答
发布评论:

提交评论
- 加载中...

热议问题