How to optimize MySQL Boolean Full-Text Search? (Or what to replace it with?) - C#

前端未结

关注

 1  1514

I have a table that contains approximately 22000 rows and I used a Boolean Full-Text Search in order to find what I`m interested in. My problem is that I created a \'dynamic

相关标签:

1条回答

情歌与酒

2020-12-15 14:58
First, you should realize that RDBMS support for full text indexing is a hack to force a technology designed to allow efficient access to structured data to deal with unstructured text. (Yes, that's just my opinion. If required, I can defend it as I understand both technologies extremely well. ;)

So, what can be done to improve search performance?

Option One - "The Best Tool For The Task"

The best way to handle full-text search within a corpus of documents is the use technology specifically designed to do so, such as SOLR (Lucene) from Apache or Sphinx from err, Sphinx.

For reasons that will become clear below, I strongly recommend this approach.

Option Two - Preload Your Results

When constructing text-based search solutions, the usual approach is to index all documents into a single searchable index and while this might be the most expedient, it is not the only approach.

Assuming what you're searching for can be easily quantified into a set of known rules, you could offer more of a "guided" style of search than simply unqualified full-text. What I mean by this is, if your application might benefit from guilding users to results, you can preload various sets of results based on a known set of rules into their own tables, and thus reduce the bulk of data to be searched.

If you expect a majority of your users will benefit from a known set of search terms in a known order, you can construct your search UI to favor those terms.

So assuming a majority of users are looking for a variety of automobile, you might offer predefined searches based on model, year, condition, etc. Your search UI would be crafted as a series of dropdown menus to "guide" users to specific results.

Or if a majority of searches will be for a specific main topic (say 'automobiles') you could predefine a table of only those records you've previously identified as being related to automobiles.

Both of these approaches would reduce the number of records to be searched and so, increase response times.

Option Three - "Roll Your Own"

If you cannot integrate an external search technology into your project and preloading isn't an option, there are still ways to vastly improve search query response times, but they differ based on what you need to accomplish and how you expect searches to be carried out.

If you expect users to search using single keywords or phrases and boolean relationships between them, you might consider constructing your own 'inverted index' of your corpus. (This is what MySQL's Boolean Full-Text Search already does, but doing it yourself allows greater control over both the speed and accuracy of search.)

To build an inverted index from your existing data:

Step 1. Create three tables
```
    // dict - a dictionary containing one row per unique word in corpus  
    create table dict (    
      id int primary key,  
      word varchar  
    )

    // invert - an inverted_index to map words to records in corpus  
    create table invert (    
      id int primary key,  
      rec_id int,  
      word_id int  
    )

    // stopwords - to contain words to ignore when indexing (like a, an, the, etc)
    create table stopwords ( 
      id int primary key,  
      word varchar  
    )
```
Note: This is just a sketch. You'll want to add indexes and constraints, etc. when you actually create these tables.

The stopwords table is used to reduce the size of your index to only those words that matter to users' expected queries. For example, it's rarely useful to index English articles, like 'a', 'an', 'the', since they do not contribute useful meaning to keyword searches.

Typically, you'll require a stopword list specifically crafted to the needs of your application. If you never expect users to include the terms 'red', 'white' or 'blue' in their queries or if these terms appear in every searchable record, you would want to add them to your stopword list.

See the note at the end of this message for instructions on using your own stopwords list in MySQL.

See also:
- The current list of stopwords supported in MySQL
- A good starting stopword list in English
Step 2. Build the Inverted Index

To build an inverted index from your existing records, you'll need to (pseudo-code):
```
    foreach( word(w) in record(r) ) {
      if(w is not in stopwords) {
        if( w does not exist in dictionary) {
          insert w to dictionary at w.id
        }
        insert (r.id, w.id) into inverted_index
      }
    }
```
More on stopwords:
nstead of using a specific stopword list, the 'if(w is not in stopwords)' test could make other decisions either instead of or as an adjunct to your list of unacceptable words.

Your application might wish to filter out all words less than 4 characters long or to only include words from a predefined set.

By creating your own inverted index, you gain far greater and finer-grained control over search.

Step 3. Query the Inverted Index Using SQL

This step really depends on how you expect queries to submitted to your index.

If queries are to be 'hard-coded', you can simply create the select statement yourself or if you need to support user-entered queries, you'll need to convert whatever query language you choose into an SQL statement (typically done using a simple parser).

Assuming you wish to retrieve all documents matching the logical query '(word1 AND word2) OR word3', a possible approach might be:
```
CREATE TEMPORARY TABLE temp_results ( rec_id int, count int ) AS 
    ( SELECT rec_id, COUNT(rec_id) AS count 
      FROM invert AS I, dict AS D 
      WHERE I.word_id=D.id AND (D.word='word1' OR D.word='word2') 
      GROUP BY I.rec_id 
      HAVING count=2
    ) 
    UNION (
      SELECT rec_id, 1 AS count 
      FROM invert AS I, dict AS D
      WHERE I.word_id=D.id AND D.word='word3'
    );

SELECT DISTINCT rec_id FROM temp_results;

DROP TABLE temp_results;
```
NOTE: This is just a first pass off the top of my head. I am confident there are more efficient ways of converting a boolean query expression into an efficient SQL statement and welcome any and all suggestions for improvement.

To search for phrases, you'll need to add a field to the inverted index to represent the position the word appeared within its record and factor that into your SELECT.

And finally, you'll need to update your inverted index as you add new records or delete old ones.

Final Word

"Full text search" falls under a very large area of research known as "Information Retrieval" or IR and there are many books on the subject, including
- Information Retrieval: Implementing and Evaluating Search Engines by Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack (Jul 23, 2010)
- Search Engines: Information Retrieval in Practice by Bruce Croft, Donald Metzler and Trevor Strohman (Feb 16, 2009)
- Building Search Applications: Lucene, LingPipe, and Gate by Manu Konchady (Jun 2008)
Check Amazon for more.

Notes

How To Use Your Own List of Stopwords in MySQL

To use your own stopword list in MySQL:
1. Create your own list of stopwords, one word per line, and save it to a known location on your server, say: /usr/local/lib/IR/stopwords.txt
2. Edit my.cnf to add or update the following lines:
```
    [mysqld]  
    ft_min_word_len=1    
    ft_max_word_len=40  
    ft_stopword_file=/usr/local/lib/IR/stopwords.txt
```
  which will set the minimum and maximum length of legal words to 1 and 40, respectively, and tell mysqld where to find your custom list of stopwords.
  
  (Note: the default ft_max_word_len is 84, which I believe is pretty excessive and can cause runs of strings that are not real words to be indexed.)
3. Restart mysqld
4. Drop and recreate all full-text related indices
0 讨论(0)
发布评论:

提交评论
- 加载中...

How to optimize MySQL Boolean Full-Text Search? (Or what to replace it with?) - C#

Option One - "The Best Tool For The Task"

Option Two - Preload Your Results

Option Three - "Roll Your Own"

Step 1. Create three tables

Step 2. Build the Inverted Index

Step 3. Query the Inverted Index Using SQL

Final Word

Notes

How To Use Your Own List of Stopwords in MySQL