How to index Word 2003, 2007 and 2010 documents using Lucene.NET

旧城冷巷雨未停 提交于 2019-12-04 09:40:01

问题


I am writing a custom Lucene.NET indexer to enable indexing of MS Word documents. The indexer must be capable of handling last three releases of MS Word: 2010, 2007 and 2003.

The plan is to use VSTO interop assemblies that are installed as part of VS2010 to extract text content from the documents.

Is there a better way to implement Word document indexing? Does this mean I will have to install all three versions of Word on the server? Or just Word 2010?

Tools/Environment:

  • Lucene.NET 2.3.1.3
  • VS2010 / .NET 3.5
  • Windows 2008 / IIS 7

Note: For details on how to implement this, see Sitecore text search in PDF or Word documents


回答1:


You could you use the IFilter plugins to let you retrieve the contents of the documents and then index them. The interface is originally part of Microsoft Index Service but is generally available for indexing documents.

I looked into the technology a couple of years ago and seem to remember that either the filters for Office documents were built into Windows or could be installed separately from the complete Office package but I may be wrong here.

More about the IFilter technology at IFilter at Wikipedia and IFilter at MSDN. You will have to look into P/Invoke and might get some inspiration IFilter at pinvoke.net.

A sample in C# can be found at MSDN Code Gallery.



来源:https://stackoverflow.com/questions/4014337/how-to-index-word-2003-2007-and-2010-documents-using-lucene-net

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!