How is field length defined in Solr/Lucene?

匿名 (未验证) 提交于 2019-12-03 01:33:01

问题:

As I understand it, a field length of a given document is the number of terms indexed in the field of the given document. However, it seems that the field length is never an integer. For instance, I've seen a document with two terms in its content field, but the content field length as calculated by Solr is actually 2.56, not 2 as I've expected. How is a field length really being calculated in Solr/Lucene?

I'm referring to the field length as it is used when calculating the score according to the BM25 similarity function, but I think that field lengths are being calculated for other ranking schemes.

回答1:

As i see in the code for BM25Similarity:

  public final long computeNorm(FieldInvertState state) {     final int numTerms = discountOverlaps ? state.getLength() - state.getNumOverlap() : state.getLength();     return encodeNormValue(state.getBoost(), numTerms);   } 

where state#getLength() is:

  /**    * Get total number of terms in this field.    * @return the length    */   public int getLength() {     return length;   } 

Actually, it's an integer. Could you please tell, where do you see non integer values? SolrAdmin UI? where?

Now, as you posted the output, I found the place where it comes from: source

Take a look at this:

private Explanation explainTFNorm(int doc, Explanation freq, BM25Stats stats, NumericDocValues norms) {     List<Explanation> subs = new ArrayList<>();     subs.add(freq);     subs.add(Explanation.match(k1, "parameter k1"));     if (norms == null) {       subs.add(Explanation.match(0, "parameter b (norms omitted for field)"));       return Explanation.match(           (freq.getValue() * (k1 + 1)) / (freq.getValue() + k1),           "tfNorm, computed from:", subs);     } else {       float doclen = decodeNormValue((byte)norms.get(doc));       subs.add(Explanation.match(b, "parameter b"));       subs.add(Explanation.match(stats.avgdl, "avgFieldLength"));       subs.add(Explanation.match(doclen, "fieldLength"));       return Explanation.match(           (freq.getValue() * (k1 + 1)) / (freq.getValue() + k1 * (1 - b + b * doclen/stats.avgdl)),           "tfNorm, computed from:", subs);     }   } 

So, by the field length they output: float doclen = decodeNormValue((byte)norms.get(doc));

 /** The default implementation returns <code>1 / f<sup>2</sup></code>    * where <code>f</code> is {@link SmallFloat#byte315ToFloat(byte)}. */   protected float decodeNormValue(byte b) {     return NORM_TABLE[b & 0xFF];   }  /** Cache of decoded bytes. */   private static final float[] NORM_TABLE = new float[256];    static {     for (int i = 1; i < 256; i++) {       float f = SmallFloat.byte315ToFloat((byte)i);       NORM_TABLE[i] = 1.0f / (f*f);     }     NORM_TABLE[0] = 1.0f / NORM_TABLE[255]; // otherwise inf   } 

In fact, looking at the wikipedia this docLen should be

a |D| is the length of the document D in words



回答2:

Elaborating the previous answer "fieldLength" is calculated via complicated mathematical normalization (encoding/decoding) equation (basically compressing 32 bit integers to 8 bits to save disk space while storing the data) in class SmallFloat.java.

This is description of decodeNormValue() function which calculates the fieldLength in BM25:

Default scoring implementation which {@link encodeNormValue(float) encodes} norm values as a single byte before being stored. At search time, the norm byte value is read from the index {@link org.apache.lucene.store.Directory directory} and {@link decodeNormValue(long) decoded} back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.875

Hope this helps.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!