Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端 未结 16 576
野趣味
野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

  1. Read 1 terabyte of content
  2. Make a co
16条回答
  •  生来不讨喜
    2020-11-30 17:57

    Storm is the technogy to look at. It separates the role of data input (spouts ) from processors (bolts). The chapter 2 of the storm book solves your exact problem and describes the system architecture very well - http://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010

    Storm is more real time processing as opposed to batch processing with Hadoop. If your data is per existing then you can distribute loads to different spouts and spread them for processing to different bolts .

    This algorithm also will enable support for data growing beyond terabytes as the date will be analysed as it arrives in real time.

提交回复
热议问题