What is the best approach to implement search for searching documents (PDF, XML, HTML, MS Word)?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-23 02:32:35

问题


What could be a good way to code a search functionality for searching documents in a java web application?

Is 'tagged search' a good fit for such kind of search functionality?


回答1:


Why re-invent the wheel?

Check out Apache Lucene.

Also, search Stack Overflow for "full text search" and you'll find a lot of other very similar questions. Here's another one, for example: How do I implement Search Functionality in a website?




回答2:


You could use Solr which sits on top of Lucene, and is a real web search engine application, while the Lucene is a library. However neither Solr or Lucene parse the Word document, pdf, etc. to extract meta data information. It's necessary to index the document based on a pre-defined document schema.




回答3:


As for extracting the text content of Office documents (which you need to do before giving it to Lucene), there is the Apache Tika project, which supports quite a few file formats, including Microsoft's.




回答4:


Using Tika, the code to get the text from a file is quite simple:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;

// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()

So far, Tika 0.3 seems to work great. Just throw any file at it and it will give you back what makes the most sense for that format. I can get the text for indexing of anything I've thrown at it so far, including PDF's and the new MS Office files. If there are problems with some formats, I believe they mainly lie in getting formatted text extraction rather than just raw plaintext.




回答5:


Just for updating

There is another alternative instead of Solr, called "ElasticSearch", its a project with good capabilities, similar to Solr, but schemaless.

Both projecs are build on top of Lucene.



来源:https://stackoverflow.com/questions/831738/what-is-the-best-approach-to-implement-search-for-searching-documents-pdf-xml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!