How to do an accent and case-insensitive search in MediaWiki database?

Let's pretend that I have these page titles in my wiki (MediaWiki 1.19.4):

SOMETHIng
Sómethìng
SomêthÏng
SÒmetHínG

If a user searches something I want that all 4 pages are returned as the result.

At the moment the only thing I could think of is this query (MySQL Percona 5.5.30-30.2):

SELECT page_title
FROM page
WHERE page_title LIKE '%something%' COLLATE utf8_general_ci

Which only returns SOMETHIng.

I must be on the right path, because if I search sóméthíng OR SÓMÉTHÍNG, I get SOMETHIng as the result. How could I modify the query so I get the other results as expected? Performance is not critical here since the page table contains only ~2K rows.

This is the table definition with the relevant bits:

CREATE TABLE page (
    (...)
    page_title VARCHAR(255) NOT NULL DEFAULT '' COLLATE latin1_bin,
    (...)
    UNIQUE INDEX name_title (page_namespace, page_title),
)

The table definition must not be modified, since this is a stock installation of MediaWiki and AFAIK its code expects this field being defined that way (i.e. unicode stored as binary data).

The MediaWiki TitleKey extension is basically designed for this, but it only does case-folding. However, if you don't mind hacking it a bit, and have the PHP iconv extension installed, you could edit TitleKey_body.php and replace the method:

static function normalize( $text ) {
    global $wgContLang;
    return $wgContLang->caseFold( $text );
}

with e.g.:

static function normalize( $text ) {
    return strtoupper( iconv( 'UTF-8', 'US-ASCII//TRANSLIT', $text ) );
}

and (re)run rebuildTitleKeys.php.

The TitleKey extension stores its normalized titles in a separate table, surprisingly named titlekey. It's intended to accessed through the MediaWiki search interface, but if you want, you can certainly query it directly too, e.g. like this:

SELECT page.* FROM page
  JOIN titlekey ON tk_page = page_id
WHERE tk_namespace = 0 AND tk_key = 'SOMETHING';

MM.

I found the perfect solution, no modyfing or creating tables. It might have performance implications (I didn't test), but as I stated in my question, it's a ~2K rows table so it shouldn't matter much.

The root of the problem is that MediaWiki stores UTF8-encoded text in latin1-encoded tables. It doesn't matter much to MediaWiki since it's aware of it and it'll always query the database with the correct charset and do its thing, essentially using MySQL as a dumb bit container. It does this because apparently UTF8 support in MySQL is not adequate for its needs (see comments in MediaWiki's DefaultSettings.php, variable $wgDBmysql5).

The problem appears when you want the database itself to be able to perform UTF8-aware operations (like I wanted to do in my question). You won't be able to do that because as far as MySQL knows, it's not storing UTF8-encoded text (although it is, as explained in the previous paragraph).

There's an obvious solution for this: cast to UTF8 the column you want to use, something like this CONVERT(col_name USING utf8). The problem here is that MySQL is trying to be dangerously helpful: it thinks that col_name is storing latin1-encoded text and it will translate (not encode) each byte into its UTF8 equivalent, and you will end with double-encoded UTF8, which is obviously wrong.

How to avoid MySQL being so nice and helpful? Just cast to BINARY before doing the conversion to UTF8! That way MySQL won't assume anything and will do exactly as asked: encoding this bunch of bits into UTF8. The exact syntax is CONVERT(CAST(col_name AS BINARY) USING utf8).

So this is my final query now:

SELECT CONVERT(CAST(page_title AS BINARY) USING utf8)
FROM page
WHERE
    CONVERT(CAST(page_title AS BINARY) USING utf8)
        LIKE '%keyword_here%'
            COLLATE utf8_spanish_ci

Now if I search something or sôMëthîNG or any variation, I get all the results!

Please note that I used utf8_spanish_ci because I want the search to differentiate ñ from n but not á from a. Use a different collation according to your use case (here is a complete list).