I would like to get some frequently occurring phrases with Lucene. I am getting some information from TXT files, and I am losing a lot of context for not having information
Is it possible for you to post any code that you have written?
Basically a lot depends on the way you create your fields and store documents in lucene.
Lets consider a case where I have got two fields: ID and Comments; and in my ID field I allow values like this 'finding nemo' i.e. strings with space. Whereas 'Comments' is a free flow text field i.e. I allow anything and everything which my keyboard allows and what lucene can understand.
Now in real life scenario it does not make sense to make my ID:'finding nemo' as two different searchable string. Whereas I want to index everything in Comments.
So what I will do is, I will create a document (org.apache.lucene.document.Document) object to take care of this... Something like this
Document doc = new Document();
doc.add(new Field("comments","Finding nemo was a very tough job for a clown fish ...", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("id", "finding nemo", Field.Store.YES, Field.Index.NOT_ANALYZED));
So, essentially I have created two fields:
Field.Index.ANALYZEDField.Index.NOT_ANALYZEDThis is how you customize lucene for Default Tokenizer and analyser. Otherwise you can write your own Tokenizer and analyzers.
Link(s) http://darksleep.com/lucene/
Hope this will help you... :)