How to do an accent and case-insensitive search in MediaWiki database?

你离开我真会死。 提交于 2019-11-30 21:23:57

The MediaWiki TitleKey extension is basically designed for this, but it only does case-folding. However, if you don't mind hacking it a bit, and have the PHP iconv extension installed, you could edit TitleKey_body.php and replace the method:

static function normalize( $text ) {
    global $wgContLang;
    return $wgContLang->caseFold( $text );
}

with e.g.:

static function normalize( $text ) {
    return strtoupper( iconv( 'UTF-8', 'US-ASCII//TRANSLIT', $text ) );
}

and (re)run rebuildTitleKeys.php.

The TitleKey extension stores its normalized titles in a separate table, surprisingly named titlekey. It's intended to accessed through the MediaWiki search interface, but if you want, you can certainly query it directly too, e.g. like this:

SELECT page.* FROM page
  JOIN titlekey ON tk_page = page_id
WHERE tk_namespace = 0 AND tk_key = 'SOMETHING';
MM.

I found the perfect solution, no modyfing or creating tables. It might have performance implications (I didn't test), but as I stated in my question, it's a ~2K rows table so it shouldn't matter much.

The root of the problem is that MediaWiki stores UTF8-encoded text in latin1-encoded tables. It doesn't matter much to MediaWiki since it's aware of it and it'll always query the database with the correct charset and do its thing, essentially using MySQL as a dumb bit container. It does this because apparently UTF8 support in MySQL is not adequate for its needs (see comments in MediaWiki's DefaultSettings.php, variable $wgDBmysql5).

The problem appears when you want the database itself to be able to perform UTF8-aware operations (like I wanted to do in my question). You won't be able to do that because as far as MySQL knows, it's not storing UTF8-encoded text (although it is, as explained in the previous paragraph).

There's an obvious solution for this: cast to UTF8 the column you want to use, something like this CONVERT(col_name USING utf8). The problem here is that MySQL is trying to be dangerously helpful: it thinks that col_name is storing latin1-encoded text and it will translate (not encode) each byte into its UTF8 equivalent, and you will end with double-encoded UTF8, which is obviously wrong.

How to avoid MySQL being so nice and helpful? Just cast to BINARY before doing the conversion to UTF8! That way MySQL won't assume anything and will do exactly as asked: encoding this bunch of bits into UTF8. The exact syntax is CONVERT(CAST(col_name AS BINARY) USING utf8).

So this is my final query now:

SELECT CONVERT(CAST(page_title AS BINARY) USING utf8)
FROM page
WHERE
    CONVERT(CAST(page_title AS BINARY) USING utf8)
        LIKE '%keyword_here%'
            COLLATE utf8_spanish_ci

Now if I search something or sôMëthîNG or any variation, I get all the results!

Please note that I used utf8_spanish_ci because I want the search to differentiate ñ from n but not á from a. Use a different collation according to your use case (here is a complete list).

Related links:

Case insensitive: you can simply let the database do the work for you (you already do with _ci)

Accents: In order to have all accents or at least all known accents you could use two rows in your database. The first row stores the result as it is (it means you store SomêthÏng) and you create additionally a second search_row which would in this case contain the string something (without any accents). For the conversion you can make a function with replace rules.

Now you can convert the search string using the conversion function.

The last step is, you make a trigger, which fills/updates the field search_row every time you insert/update the title in the table page.

This solution wouldn't have any negative impact on the performance either!

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!