Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端 未结 16 545
野趣味
野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

  1. Read 1 terabyte of content
  2. Make a co
16条回答
  •  [愿得一人]
    2020-11-30 18:09

    As a quick general algorithm I would do this.

    Create a map with entries being the count for a specific word and the key being the actual string.  
    
    for each string in content:
       if string is a valid key for the map:
          increment the value associated with that key
       else
          add a new key/value pair to the map with the key being the word and the count being one
    done
    

    Then you could just find the largest value in the map

    
    create an array size 10 with data pairs of (word, count) 
    
    for each value in the map
        if current pair has a count larger than the smallest count in the array
            replace that pair with the current one
    
    print all pairs in array
    

提交回复
热议问题