发表新帖

发表新帖

How does lucene index documents?

后端未结

关注

 4  1728

予麋鹿 2020-12-04 05:10

I read some document about Lucene; also I read the document in this link (http://lucene.sourceforge.net/talks/pisa).

I don\'t really understand how Lucene indexes do

4条回答

旧时难觅i (楼主)

2020-12-04 05:52

There's a fairly good article here: https://web.archive.org/web/20130904073403/http://www.ibm.com/developerworks/library/wa-lucene/

Edit 12/2014: Updated to an archived version due to the original being deleted, probably the best more recent alternative is http://lucene.apache.org/core/3_6_2/fileformats.html

There's an even more recent version at http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene410/package-summary.html#package_description, but it seems to have less information in it than the older one.

In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.

Terms are generated using an analyzer which stems each word to its root form. The most popular stemming algorithm for the english language is the Porter stemming algorithm: http://tartarus.org/~martin/PorterStemmer/

When a query is issued it is processed through the same analyzer that was used to build the index and then used to look up the matching term(s) in the index. That provides a list of documents that match the query.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题