Python: Whoosh seems to return incorrect results

我的梦境 提交于 2019-12-11 03:22:51

问题


This code is straight from Whoosh's quickstart docs:

import os.path
from whoosh.index import create_in
from whoosh.fields import Schema, STORED, ID, KEYWORD, TEXT
from whoosh.index import open_dir
from whoosh.query import *
from whoosh.qparser import QueryParser

#establish schema to be used in the index
schema = Schema(title=TEXT(stored=True), content=TEXT,
                path=ID(stored=True), tags=KEYWORD, icon=STORED)

#create index directory
if not os.path.exists("index"):
    os.mkdir("index")

#create the index using the schema specified above
ix = create_in("index", schema)

#instantiate the writer object
writer = ix.writer()

#add the docs to the index
writer.add_document(title=u"My document", content=u"This is my document!",
                    path=u"/a", tags=u"first short", icon=u"/icons/star.png")
writer.add_document(title=u"Second try", content=u"This is the second example.",
                    path=u"/b", tags=u"second short", icon=u"/icons/sheep.png")
writer.add_document(title=u"Third time's the charm", content=u"Examples are many.",
                    path=u"/c", tags=u"short", icon=u"/icons/book.png")

#commit those changes
writer.commit()

#identify searcher
with ix.searcher() as searcher:

    #specify parser
    parser = QueryParser("content", ix.schema)

    #specify query -- try also "second"
    myquery = parser.parse("is")

    #search for results
    results = searcher.search(myquery)

    #identify the number of matching documents
    print len(results)

I have merely passed a value--namely, the verb "is"--to the parser.parse() call. When I run this, however, I get results of length zero, rather than the expected results of length two. If I replace "is" with "second", I get one result, as expected. Why doesn't the search using "is" yield a match, though?

Edit

As @Philippe points out, the default Whoosh indexer removes stop words, hence the behavior described above. If you want to retain stop words, you can specify which analyzer you wish to use when indexing a given field within an index, and you can pass your analyzer a parameter to refrain from stripping stop words; e.g.:

schema = Schema(title=TEXT(stored=True, analyzer=analysis.StandardAnalyzer(stoplist=None)))

回答1:


A stop word filter is applied by the default text analyzer: https://bitbucket.org/mchaput/whoosh/src/999cd5fb0d110ca955fab8377d358e98ba426527/src/whoosh/analysis/filters.py?at=default#cl-41

See also the doc: http://whoosh.readthedocs.org/en/latest/api/analysis.html#whoosh.analysis.StopFilter



来源:https://stackoverflow.com/questions/25087290/python-whoosh-seems-to-return-incorrect-results

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!