What is the best approach to implement search for searching documents (PDF, XML, HTML, MS Word)?

问题

What could be a good way to code a search functionality for searching documents in a java web application?

Is 'tagged search' a good fit for such kind of search functionality?

回答1:

Why re-invent the wheel?

Check out Apache Lucene.

Also, search Stack Overflow for "full text search" and you'll find a lot of other very similar questions. Here's another one, for example: How do I implement Search Functionality in a website?

回答2:

You could use Solr which sits on top of Lucene, and is a real web search engine application, while the Lucene is a library. However neither Solr or Lucene parse the Word document, pdf, etc. to extract meta data information. It's necessary to index the document based on a pre-defined document schema.

回答3:

As for extracting the text content of Office documents (which you need to do before giving it to Lucene), there is the Apache Tika project, which supports quite a few file formats, including Microsoft's.

回答4:

Using Tika, the code to get the text from a file is quite simple:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;

// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()

So far, Tika 0.3 seems to work great. Just throw any file at it and it will give you back what makes the most sense for that format. I can get the text for indexing of anything I've thrown at it so far, including PDF's and the new MS Office files. If there are problems with some formats, I believe they mainly lie in getting formatted text extraction rather than just raw plaintext.

回答5:

Just for updating

There is another alternative instead of Solr, called "ElasticSearch", its a project with good capabilities, similar to Solr, but schemaless.

Both projecs are build on top of Lucene.

来源：https://stackoverflow.com/questions/831738/what-is-the-best-approach-to-implement-search-for-searching-documents-pdf-xml

标签

java

pdf

ms-word