As I understand it, a field length of a given document is the number of terms indexed in the field of the given document. However, it seems that the field length is never an integer. For instance, I've seen a document with two terms in its content field, but the content field length as calculated by Solr is actually 2.56, not 2 as I've expected. How is a field length really being calculated in Solr/Lucene?
I'm referring to the field length as it is used when calculating the score according to the BM25 similarity function, but I think that field lengths are being calculated for other ranking schemes.
As i see in the code for BM25Similarity:
public final long computeNorm(FieldInvertState state) { final int numTerms = discountOverlaps ? state.getLength() - state.getNumOverlap() : state.getLength(); return encodeNormValue(state.getBoost(), numTerms); }
where state#getLength() is:
/** * Get total number of terms in this field. * @return the length */ public int getLength() { return length; }
Actually, it's an integer. Could you please tell, where do you see non integer values? SolrAdmin UI? where?
Now, as you posted the output, I found the place where it comes from: source
Take a look at this:
private Explanation explainTFNorm(int doc, Explanation freq, BM25Stats stats, NumericDocValues norms) { List<Explanation> subs = new ArrayList<>(); subs.add(freq); subs.add(Explanation.match(k1, "parameter k1")); if (norms == null) { subs.add(Explanation.match(0, "parameter b (norms omitted for field)")); return Explanation.match( (freq.getValue() * (k1 + 1)) / (freq.getValue() + k1), "tfNorm, computed from:", subs); } else { float doclen = decodeNormValue((byte)norms.get(doc)); subs.add(Explanation.match(b, "parameter b")); subs.add(Explanation.match(stats.avgdl, "avgFieldLength")); subs.add(Explanation.match(doclen, "fieldLength")); return Explanation.match( (freq.getValue() * (k1 + 1)) / (freq.getValue() + k1 * (1 - b + b * doclen/stats.avgdl)), "tfNorm, computed from:", subs); } }
So, by the field length they output: float doclen = decodeNormValue((byte)norms.get(doc));
/** The default implementation returns <code>1 / f<sup>2</sup></code> * where <code>f</code> is {@link SmallFloat#byte315ToFloat(byte)}. */ protected float decodeNormValue(byte b) { return NORM_TABLE[b & 0xFF]; } /** Cache of decoded bytes. */ private static final float[] NORM_TABLE = new float[256]; static { for (int i = 1; i < 256; i++) { float f = SmallFloat.byte315ToFloat((byte)i); NORM_TABLE[i] = 1.0f / (f*f); } NORM_TABLE[0] = 1.0f / NORM_TABLE[255]; // otherwise inf }
In fact, looking at the wikipedia this docLen should be
a |D| is the length of the document D in words
Elaborating the previous answer "fieldLength" is calculated via complicated mathematical normalization (encoding/decoding) equation (basically compressing 32 bit integers to 8 bits to save disk space while storing the data) in class SmallFloat.java.
This is description of decodeNormValue() function which calculates the fieldLength in BM25:
Default scoring implementation which {@link encodeNormValue(float) encodes} norm values as a single byte before being stored. At search time, the norm byte value is read from the index {@link org.apache.lucene.store.Directory directory} and {@link decodeNormValue(long) decoded} back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.875
Hope this helps.