What is the best way to implement a substring search in SQL?

前端 未结 4 2134
失恋的感觉
失恋的感觉 2020-12-08 11:55

We have a simple SQL problem here. In a varchar column, we wanted to search for a string anywhere in the field. What is the best way to implement this for performance? Obvio

4条回答
  •  [愿得一人]
    2020-12-08 12:32

    Check out my presentation Practical Fulltext Search in MySQL.

    I compared:

    • LIKE predicates
    • Regular expression predicates (no better than LIKE)
    • MyISAM FULLTEXT indexing
    • Sphinx Search
    • Apache Lucene
    • Inverted indexing
    • Google Custom Search Engine

    Today what I would use is Apache Solr, which puts Lucene into a service with a bunch of extra features and tools.


    Re your comment: Aha, okay, no. None of the fulltext search capabilities I mentioned are going to help, since they all assume some kind of word boundaries

    The other way to efficiently find arbitrary substrings is the N-gram approach. Basically, create an index of all possible sequences of N letters and point to the strings where each respective sequence occurs. Typically this is done with N=3, or a trigram, because it's a point of compromise between matching longer substrings and keeping the index to a manageable size.

    I don't know of any SQL database that supports N-gram indexing transparently, but you could set it up yourself using an inverted index:

    create table trigrams (
      trigram char(3) primary key
    );
    
    create table trigram_matches (
      trigram char(3),
      document_id int,
      primary key (trigram, document_id),
      foreign key (trigram) references trigrams(trigram),
      foreign key (document_id) references mytable(document_id)
    );
    

    Now populate it the hard way:

    insert into trigram_matches
      select t.trigram, d.document_id
      from trigrams t join mytable d
        on d.textcolumn like concat('%', t.trigram, '%');
    

    Of course this will take quite a while! But once it's done, you can search much more quickly:

    select d.*
    from mytable d join trigram_matches t
      on t.document_id = d.document_id
    where t.trigram = 'abc'
    

    Of course you could be searching for patterns longer than three characters, but the inverted index still helps to narrow your search a lot:

    select d.*
    from mytable d join trigram_matches t
      on t.document_id = d.document_id
    where t.trigram = 'abc'
      and d.textcolumn like '%abcdef%';
    

提交回复
热议问题