Lucene encoding, java

不羁的心 提交于 2019-12-25 03:16:47

问题


I have questions about encoding in Lucene (java).

How is working with coding in Lucene? which is the default and how can I set it?

Or Lucene does not matter what it is encoding and it's just a matter of how adding a string to a document (java code is below) in the indexing phase, and then in the search in the index?

In other words, I have to worry if the input text is in UTF-8 and query are also in utf-8?

Document doc = new Document ();  
doc.add (new TextField (tagName, object.getName () Field.Store.YES));

Thanks for any help


回答1:


Lucene stores terms in UTF-8. (See Lucene's BytesRef class) Java internally stores everything in UTF-16. (Java's String is UTF-16). So, Lucene's BytesRef gives you a constructor where it converts UTF16 to UTF8. Hence Java's String can be used without any issues.

For example, TextField what you have used in your code uses String for Field value. If you have some other type of Field which takes byte[] then you need to make sure they are UTF8 bytes.

While querying, Lucene will always give you UTF-8 bytes, however you can convert that to Java's String by a method provided in the same class.You can always interpret these bytes in other character sets.

You have to take care of Character Encoding yourself - as long as you can get the characters right in Java's String, you should be fine. For eg: If the data you are indexing is from an XML with a diff char set or reading from a DB in a diff char set. You will have to make sure that you can read these data sources properly in the JVM used for indexing.



来源:https://stackoverflow.com/questions/23030329/lucene-encoding-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!