Build Lucene Synonyms

为君一笑 提交于 2020-01-06 18:12:07

问题


I've the following code

static class TaggerAnalyzer extends Analyzer {

    @Override
    protected TokenStreamComponents createComponents(String s, Reader reader) {

        SynonymMap.Builder builder = new SynonymMap.Builder(true);
        builder.add(new CharsRef("al"), new CharsRef("americanleague"), true);
        builder.add(new CharsRef("al"), new CharsRef("a.l."), true);
        builder.add(new CharsRef("nba"), new CharsRef("national" + SynonymMap.WORD_SEPARATOR + "basketball" + SynonymMap.WORD_SEPARATOR + "association"), true);

        SynonymMap mySynonymMap = null;
        try {
            mySynonymMap = builder.build();
        } catch (IOException e) {
            e.printStackTrace();
        }

        Tokenizer source = new ClassicTokenizer(Version.LUCENE_40, reader);
        TokenStream filter = new StandardFilter(Version.LUCENE_40, source);
        filter = new LowerCaseFilter(Version.LUCENE_40, filter);
        filter = new SynonymFilter(filter, mySynonymMap, true);
        return new TokenStreamComponents(source, filter);
    }
}

And I'm running some test, so far, everything went ok until I figured out this scenario.

    String title = "Very short title at a.l. bla bla"

    Assert.assertTrue(TagUtil.evaluate(memoryIndex,"americanleague"));
    Assert.assertTrue(TagUtil.evaluate(memoryIndex,"al"));

I was expecting that both cases ran successfully, but americanleague didn't match with "a.l." besides both "a.l." and "americanleague" are "al" synonyms.

So, what do I do? I don't want to add all combinations to the Map. Thanks


回答1:


I believe you have your arguments to builder.add backwards. Try:

builder.add(new CharsRef("americanleague"), new CharsRef("al"), true);
builder.add(new CharsRef("a.l."), new CharsRef("al"), true);
builder.add(new CharsRef("national" + SynonymMap.WORD_SEPARATOR + "basketball" + SynonymMap.WORD_SEPARATOR + "association"), new CharsRef("nba"), true);

The SynonymFilter just maps from the first arg (input) to the second arg (output), rather than the other way around. So you have rules to translate "al" to two different synonyms, but none that do anything to inputs of "a.l." or "americanleague".



来源:https://stackoverflow.com/questions/22078669/build-lucene-synonyms

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!