Hibernate Search: Search any part of the field without losing field's content while indexing

杀马特。学长 韩版系。学妹 提交于 2019-12-13 03:13:05

问题


I would like to be able to find an entity based on any part of its indexed fields, and the fields must not loose any content while indexing.

Lets say I have the following sample entity class:

@Entity
public class E {
    private String f;
    // ...
}

And if the value of f in one entity is "This is a nice field!", I would like to be able to find it by any of these queries:

  • "this"
  • "a"
  • "IC"
  • "!"
  • "This is a nice field!"

The most obvious decision is to annotate the entity this way:

@Entity
@Indexed
@AnalyzerDef(name = "a",
        tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
        filters = @TokenFilterDef(factory = LowerCaseFilterFactory.class)
)
@Analyzer(definition = "a")
public class E {
    @Field
    private String f;
    // ...
}

And then search the following way:

String queryString;
// ...
org.apache.lucene.search.Query query = queryBuilder
        .keyword()
        .wildcard()
        .onField("f")
        .matching("*" + queryString.toLowerCase() + "*")
        .createQuery();

But it is stated in the documentation that for performance purposes, it is recommended that the query does not start with either ? or *.

So as I understand, this method is ineffective.

The other idea is to use n-grams like this:

@Entity
@Indexed
@AnalyzerDef(name = "a",
        tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
        filters = {
                @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                @TokenFilterDef(factory = NGramFilterFactory.class,
                        params = {
                                @Parameter(name = "minGramSize", value = "1"),
                                @Parameter(name = "maxGramSize", value = E.MAX_LENGTH)
                        })
        }
)
@Analyzer(definition = "a")
public class E {
    static final String MAX_LENGTH = "42";
    @Field
    private String f;
    // ...
}

And create queries this way:

String queryString;
// ...
org.apache.lucene.search.Query query = queryBuilder
                .keyword()
                .onField("f")
                .ignoreAnalyzer()
                .matching(queryString.toLowerCase())
                .createQuery();

This time no wildcard queries are used and the analyzer in the query is ignored. I'm not sure whether ignoring the analyzer is good or bad, but it works with analyzer ignored.

Other possible solution would be to use WhitespaceTokenizerFactory instead of KeywordTokenizerFactory when using n-grams, then split queryString by spaces and combine searches for each substring using MUST. In this approach, as I understand, I will get a lot less n-grams built, if the length of the string contained in f is E.MAX_LENGTH, what must be good for performance. And I will also be able to find the previously described entity by, for example, "hi ield" query. And that would be ideal.

So what would be the best way to deal with my problem? Or are all my ideas bad?

P.S. Should one ignore analyzer in queries when using n-grams?


回答1:


Other possible solution would be to use WhitespaceTokenizerFactory instead of KeywordTokenizerFactory when using n-grams, then split queryString by spaces and combine searches for each substring using MUST. In this approach, as I understand, I will get a lot less n-grams built, if the length of the string contained in f is E.MAX_LENGTH, what must be good for performance. And I will also be able to find the previously described entity by, for example, "hi ield" query. And that would be ideal.

This is more or less the ideal solution, except for one thing: you shouldn't ignore the analyzer when querying. What you should do is define another analyzer without the ngram filter, but with the tokenizer, lowercase filter, etc., and explicitly instruct Hibernate Search to use that analyzer at query time.

The other solutions are too expensive, either in I/O and CPU at query time (first solution) or in storage space (second solution). Note that this third solution may still be rather expensive in storage space, depending on the value of E.MAX_LENGTH. It's generally recommended to only have a difference of one or two between minGramSize and maxGramSize, to avoid the indexing of too many grams.

Just define another analyzer, name it something like "ngram_query", and when you need to build the query, create the query builder like this:

    QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(EPCAsset.class)
        .overridesForField( "f" /* name of the field */, "ngram_query" )
        .get();

Then create your query as usual.

Note that, if you rely on Hibernate Search to push the index schema and analyzers to Elasticsearch, you will have to use a hack in order for the query-only analyzer to be pushed: by default only the analyzers that are actually used during indexing are pushed. See https://discourse.hibernate.org/t/cannot-find-the-overridden-analyzer-when-using-overridesforfield/1043/4



来源:https://stackoverflow.com/questions/56083137/hibernate-search-search-any-part-of-the-field-without-losing-fields-content-wh

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!