How does MongoDB handle document length in a text index and text score?

前端 未结 2 1413
情深已故
情深已故 2020-12-18 12:01

I have a collection that has documents of widely varying amounts of text and it appears that documents with more text get significantly higher textScores. Of course, the mor

相关标签:
2条回答
  • Scoring is based on the number of stemmed matches, but there is also a built-in coefficient which adjusts the score for matches relative to total field length (with stopwords removed). If your longer text includes more relevant words to a query, this will add to the score. Longer text which does not match a query will reduce the score.

    Snippet from MongoDB 3.2 source code on GitHub (src/mongo/db/fts/fts_spec.cpp):

       for (ScoreHelperMap::const_iterator i = terms.begin(); i != terms.end(); ++i) {
            const string& term = i->first;
            const ScoreHelperStruct& data = i->second;
    
            // in order to adjust weights as a function of term count as it
            // relates to total field length. ie. is this the only word or
            // a frequently occuring term? or does it only show up once in
            // a long block of text?
    
            double coeff = (0.5 * data.count / numTokens) + 0.5;
    
            // if term is identical to the raw form of the
            // field (untokenized) give it a small boost.
            double adjustment = 1;
            if (raw.size() == term.length() && raw.equalCaseInsensitive(term))
                adjustment += 0.1;
    
            double& score = (*docScores)[term];
            score += (weight * data.freq * coeff * adjustment);
            verify(score <= MAX_WEIGHT);
        }
    }
    

    Setting up some test data to see the effect of the length coefficient on a very simple example:

    db.articles.insert([
        { headline: "Rock" },
        { headline: "Rocks" },
        { headline: "Rock paper" },
        { headline: "Rock paper scissors" },
    ])
    
    db.articles.createIndex({ "headline": "text"})
    
    db.articles.find(
        { $text: { $search: "rock" }},
        { _id:0, headline:1, score: { $meta: "textScore" }}
    ).sort({ score: { $meta: "textScore" }})
    

    Annotated results:

    // Exact match of raw term to indexed field
    // Coefficent is 1, plus 0.1 bonus for identical match of raw term
    {
      "headline": "Rock",
      "score": 1.1
    }
    
    // Match of stemmed term to indexed field ("rocks" stems to "rock")
    // Coefficent is 1
    {
      "headline": "Rocks",
      "score": 1
    }
    
    // Two terms, one matching
    // Coefficient is 0.75: (0.5 * 1 match / 2 terms) + 0.5
    {
      "headline": "Rock paper",
      "score": 0.75
    }
    
    // Three terms, one matching
    // Coefficient is 0.66: (0.5 * 1 match / 3 terms) + 0.5
    {
      "headline": "Rock paper scissors",
      "score": 0.6666666666666666
    }
    
    0 讨论(0)
  • 2020-12-18 12:43

    Mongo counts every occurrence of word in document and this is the way how score is created.

    To modify that - one could create weights on index fields - see below:

    according to mongo docs

    db.blog.createIndex(
       {
         content: "text",
         keywords: "text",
         about: "text"
       },
       {
         weights: {
           content: 10,
           keywords: 5
         },
         name: "TextIndex"
       }
     )
    

    The text index has the following fields and weights:

    content has a weight of 10, keywords has a weight of 5, and about has the default weight of 1.

    These weights denote the relative significance of the indexed fields to each other. For instance, a term match in the content field has:

    2 times (i.e. 10:5) the impact as a term match in the keywords field and 10 times (i.e. 10:1) the impact as a term match in the about field.

    0 讨论(0)
提交回复
热议问题