Text indexing algorithm

后端未结

关注

 3  1429

I am writing a C# winform application for an archiving system. The system has a huge database where some tables would have more than 1.5 million records. What i need is an a

相关标签:

3条回答

滥情空心

2020-12-16 03:45

It looks like you need two things. Firstly, you need a system which actually performs the indexing. For this, you can go with Lucene, or Apache Solr as Mikos mentioned. You also might want to check out Sphinx which is another full text search engine. You could also use the full text features built into your database. Both SQL Server and MySQL have full text indexing capabilities. As do many other databases. The second thing you need is a way to get the text out of the files. For things like txt files, and HTML files, this is easy because most full text search engines will accept them as regular text. For more complicated binary documents like MS Word or PDF, you'll have to find another way to get the text out of them.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野性不改

2020-12-16 03:53

According to me, perform a table partition, index the tables with the id's and then perform the search.

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2020-12-16 03:58
You need to create, what is known as an inverted index - which is at the core of how search engines work (a la Google). Apache Lucene is arguably the best library for inverted indexing. You have 2 options:
1. Lucene.net - a .NET port of the Java Lucene library.
2. Apache Solr - a full-fledged search server built using Lucene libs and easily integrable into your .NET application because it has a RESTful API. Comes out-of-the-box with several features such as caching, scaling, spell-checking, etc. You can make life easier for your app-to-Solr interaction using the excellent SolrNet library.
3. Apache Tika offers a very extensive data/metadata extraction toolkit working with PDFs, HTMLs, MS Office docs etc. A simpler option would be to the IFilter API. See this article for more details.
0 讨论(0)
发布评论:

提交评论
- 加载中...