问题
What could be a good way to code a search functionality for searching documents in a java web application?
Is 'tagged search' a good fit for such kind of search functionality?
回答1:
Why re-invent the wheel?
Check out Apache Lucene.
Also, search Stack Overflow for "full text search" and you'll find a lot of other very similar questions. Here's another one, for example: How do I implement Search Functionality in a website?
回答2:
You could use Solr which sits on top of Lucene, and is a real web search engine application, while the Lucene is a library. However neither Solr or Lucene parse the Word document, pdf, etc. to extract meta data information. It's necessary to index the document based on a pre-defined document schema.
回答3:
As for extracting the text content of Office documents (which you need to do before giving it to Lucene), there is the Apache Tika project, which supports quite a few file formats, including Microsoft's.
回答4:
Using Tika, the code to get the text from a file is quite simple:
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;
// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()
So far, Tika 0.3 seems to work great. Just throw any file at it and it will give you back what makes the most sense for that format. I can get the text for indexing of anything I've thrown at it so far, including PDF's and the new MS Office files. If there are problems with some formats, I believe they mainly lie in getting formatted text extraction rather than just raw plaintext.
回答5:
Just for updating
There is another alternative instead of Solr, called "ElasticSearch", its a project with good capabilities, similar to Solr, but schemaless.
Both projecs are build on top of Lucene.
来源:https://stackoverflow.com/questions/831738/what-is-the-best-approach-to-implement-search-for-searching-documents-pdf-xml