How might I index PDF files using Lucene.Net?

I'm looking for some sample code demonstrating how to index PDF documents using Lucene.Net and C#. Google turned up a few, but none that I could find helpful.

From my understanding, Lucene is limited to creating an index and searching that index. It's up to the application to handle opening files and extracting their contents for the index. So if you're looking to search PDF documents you'll want to use something like iTextSharp to open the file, pull out the contents, and pass it to Lucene for indexing. There are some good starting examples of using Lucene on the Dimecasts.net website.

arachnode.net

StringBuilder stringBuilder = new StringBuilder();

PdfReader pdfReader = new PdfReader(byte[] of the .pdf);

for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    stringBuilder.Append(PdfTextExtractor.GetTextFromPage(pdfReader, page) + " ");
}

(using iTextSharp)

The rest isn't as succinctly illustrated.

There is code in the product demo on my site that shows how to use the lucene.net code, but it is a little long to post here.

Here is the code as pertaining to my product: https://svn.arachnode.net/svn/arachnodenet/trunk/Plugins/CrawlActions/ManageLuceneDotNetIndexes.cs Username/Password: Public

来源：https://stackoverflow.com/questions/1275722/how-might-i-index-pdf-files-using-lucene-net

标签

lucene.net

implementation

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!