I have a collection that has documents of widely varying amounts of text and it appears that documents with more text get significantly higher textScores. Of course, the mor
Scoring is based on the number of stemmed matches, but there is also a built-in coefficient which adjusts the score for matches relative to total field length (with stopwords removed). If your longer text includes more relevant words to a query, this will add to the score. Longer text which does not match a query will reduce the score.
Snippet from MongoDB 3.2 source code on GitHub (src/mongo/db/fts/fts_spec.cpp):
for (ScoreHelperMap::const_iterator i = terms.begin(); i != terms.end(); ++i) {
const string& term = i->first;
const ScoreHelperStruct& data = i->second;
// in order to adjust weights as a function of term count as it
// relates to total field length. ie. is this the only word or
// a frequently occuring term? or does it only show up once in
// a long block of text?
double coeff = (0.5 * data.count / numTokens) + 0.5;
// if term is identical to the raw form of the
// field (untokenized) give it a small boost.
double adjustment = 1;
if (raw.size() == term.length() && raw.equalCaseInsensitive(term))
adjustment += 0.1;
double& score = (*docScores)[term];
score += (weight * data.freq * coeff * adjustment);
verify(score <= MAX_WEIGHT);
}
}
Setting up some test data to see the effect of the length coefficient on a very simple example:
db.articles.insert([
{ headline: "Rock" },
{ headline: "Rocks" },
{ headline: "Rock paper" },
{ headline: "Rock paper scissors" },
])
db.articles.createIndex({ "headline": "text"})
db.articles.find(
{ $text: { $search: "rock" }},
{ _id:0, headline:1, score: { $meta: "textScore" }}
).sort({ score: { $meta: "textScore" }})
Annotated results:
// Exact match of raw term to indexed field
// Coefficent is 1, plus 0.1 bonus for identical match of raw term
{
"headline": "Rock",
"score": 1.1
}
// Match of stemmed term to indexed field ("rocks" stems to "rock")
// Coefficent is 1
{
"headline": "Rocks",
"score": 1
}
// Two terms, one matching
// Coefficient is 0.75: (0.5 * 1 match / 2 terms) + 0.5
{
"headline": "Rock paper",
"score": 0.75
}
// Three terms, one matching
// Coefficient is 0.66: (0.5 * 1 match / 3 terms) + 0.5
{
"headline": "Rock paper scissors",
"score": 0.6666666666666666
}
Mongo counts every occurrence of word in document and this is the way how score is created.
To modify that - one could create weights on index fields - see below:
according to mongo docs
db.blog.createIndex(
{
content: "text",
keywords: "text",
about: "text"
},
{
weights: {
content: 10,
keywords: 5
},
name: "TextIndex"
}
)
The text index has the following fields and weights:
content has a weight of 10, keywords has a weight of 5, and about has the default weight of 1.
These weights denote the relative significance of the indexed fields to each other. For instance, a term match in the content field has:
2 times (i.e. 10:5) the impact as a term match in the keywords field and 10 times (i.e. 10:1) the impact as a term match in the about field.