Text indexing algorithm

后端 未结 3 1429
天命终不由人
天命终不由人 2020-12-16 03:23

I am writing a C# winform application for an archiving system. The system has a huge database where some tables would have more than 1.5 million records. What i need is an a

相关标签:
3条回答
  • 2020-12-16 03:45

    It looks like you need two things. Firstly, you need a system which actually performs the indexing. For this, you can go with Lucene, or Apache Solr as Mikos mentioned. You also might want to check out Sphinx which is another full text search engine. You could also use the full text features built into your database. Both SQL Server and MySQL have full text indexing capabilities. As do many other databases. The second thing you need is a way to get the text out of the files. For things like txt files, and HTML files, this is easy because most full text search engines will accept them as regular text. For more complicated binary documents like MS Word or PDF, you'll have to find another way to get the text out of them.

    0 讨论(0)
  • 2020-12-16 03:53

    According to me, perform a table partition, index the tables with the id's and then perform the search.

    0 讨论(0)
  • 2020-12-16 03:58

    You need to create, what is known as an inverted index - which is at the core of how search engines work (a la Google). Apache Lucene is arguably the best library for inverted indexing. You have 2 options:

    1. Lucene.net - a .NET port of the Java Lucene library.

    2. Apache Solr - a full-fledged search server built using Lucene libs and easily integrable into your .NET application because it has a RESTful API. Comes out-of-the-box with several features such as caching, scaling, spell-checking, etc. You can make life easier for your app-to-Solr interaction using the excellent SolrNet library.

    3. Apache Tika offers a very extensive data/metadata extraction toolkit working with PDFs, HTMLs, MS Office docs etc. A simpler option would be to the IFilter API. See this article for more details.

    0 讨论(0)
提交回复
热议问题