How to get frequently occurring phrases with Lucene

后端 未结 3 915
难免孤独
难免孤独 2020-12-16 05:54

I would like to get some frequently occurring phrases with Lucene. I am getting some information from TXT files, and I am losing a lot of context for not having information

3条回答
  •  眼角桃花
    2020-12-16 06:38

    Is it possible for you to post any code that you have written?

    Basically a lot depends on the way you create your fields and store documents in lucene.

    Lets consider a case where I have got two fields: ID and Comments; and in my ID field I allow values like this 'finding nemo' i.e. strings with space. Whereas 'Comments' is a free flow text field i.e. I allow anything and everything which my keyboard allows and what lucene can understand.

    Now in real life scenario it does not make sense to make my ID:'finding nemo' as two different searchable string. Whereas I want to index everything in Comments.

    So what I will do is, I will create a document (org.apache.lucene.document.Document) object to take care of this... Something like this

    Document doc = new Document();
    doc.add(new Field("comments","Finding nemo was a very tough job for a clown fish ...", Field.Store.YES, Field.Index.ANALYZED));
    doc.add(new Field("id", "finding nemo", Field.Store.YES, Field.Index.NOT_ANALYZED));
    

    So, essentially I have created two fields:

    1. comments: Where I have preferred to analyze it by using Field.Index.ANALYZED
    2. id: Where I directed lucene to store it but do not analyze it Field.Index.NOT_ANALYZED

    This is how you customize lucene for Default Tokenizer and analyser. Otherwise you can write your own Tokenizer and analyzers.

    Link(s) http://darksleep.com/lucene/

    Hope this will help you... :)

提交回复
热议问题