Mongo: count the number of word occurrences in a set of documents

前端 未结 4 805
北荒
北荒 2020-12-08 16:19

I have a set of documents in Mongo. Say:

[
    { summary:\"This is good\" },
    { summary:\"This is bad\" },
    { summary:\"Something that is neither good         


        
相关标签:
4条回答
  • 2020-12-08 17:03

    A basic MapReduce example

    var m = function() {
        var words = this.summary.split(" ");
        if (words) {
            for(var i=0; i<words.length; i++) {
                emit(words[i].toLowerCase(), 1);
            }   
        }
    }
    
    var r = function(k, v) {
        return v.length;
    };
    
    db.collection.mapReduce(
        m, r, { out: { merge: "words_count" } }
    )
    

    This will insert word counts into a collection name words_count which you can sort (and index)

    Note that it doesn't use stemming, omit punctuation, handles stop words etc.

    Also note you can optimize the map function by accumulating repeating word(s) occurrences and emitting the count, not just 1

    0 讨论(0)
  • 2020-12-08 17:04

    You can use #split. Try Below query

    db.summary.aggregate([
    { $project : { summary : { $split: ["$summary", " "] } } },
    { $unwind : "$summary" },
    { $group : { _id:  "$summary" , total : { "$sum" : 1 } } },
    { $sort : { total : -1 } }
    ]);
    
    0 讨论(0)
  • 2020-12-08 17:12

    Old question but since 4.2 this can be done with $regexFindAll now.

    db.summaries.aggregate([
      {$project: {
        occurences: {
          $regexFindAll: {
            input: '$summary',
            regex: /\b\w+\b/, // match words
          }
        }
      }},
      {$unwind: '$occurences'},
      {$group: {
        _id: '$occurences.match', // group by each word
        totalOccurences: {
          $sum: 1 // add up total occurences
        }
      }},
      {$sort: {
        totalOccurences: -1
      }}
    ]);
    

    This will output docs in the following format:

    {
      _id: "matchedwordstring",
      totalOccurences: number
    }
    
    0 讨论(0)
  • 2020-12-08 17:18

    MapReduce might be a good fit that can process the documents on the server without doing manipulation on the client (as there isn't a feature to split a string on the DB server (open issue).

    Start with the map function. In the example below (which likely needs to be more robust), each document is passed to the map function (as this). The code looks for the summary field and if it's there, lowercases it, splits on a space, and then emits a 1 for each word found.

    var map = function() {  
        var summary = this.summary;
        if (summary) { 
            // quick lowercase to normalize per your requirements
            summary = summary.toLowerCase().split(" "); 
            for (var i = summary.length - 1; i >= 0; i--) {
                // might want to remove punctuation, etc. here
                if (summary[i])  {      // make sure there's something
                   emit(summary[i], 1); // store a 1 for each word
                }
            }
        }
    };
    

    Then, in the reduce function, it sums all of the results found by the map function and returns a discrete total for each word that was emitted above.

    var reduce = function( key, values ) {    
        var count = 0;    
        values.forEach(function(v) {            
            count +=v;    
        });
        return count;
    }
    

    Finally, execute the mapReduce:

    > db.so.mapReduce(map, reduce, {out: "word_count"})
    

    The results with your sample data:

    > db.word_count.find().sort({value:-1})
    { "_id" : "is", "value" : 3 }
    { "_id" : "bad", "value" : 2 }
    { "_id" : "good", "value" : 2 }
    { "_id" : "this", "value" : 2 }
    { "_id" : "neither", "value" : 1 }
    { "_id" : "or", "value" : 1 }
    { "_id" : "something", "value" : 1 }
    { "_id" : "that", "value" : 1 }
    
    0 讨论(0)
提交回复
热议问题