Let's pretend that I have these page titles in my wiki (MediaWiki 1.19.4):
SOMETHIng
Sómethìng
SomêthÏng
SÒmetHínG
If a user searches something
I want that all 4 pages are returned as the result.
At the moment the only thing I could think of is this query (MySQL Percona 5.5.30-30.2):
SELECT page_title
FROM page
WHERE page_title LIKE '%something%' COLLATE utf8_general_ci
Which only returns SOMETHIng
.
I must be on the right path, because if I search sóméthíng
OR SÓMÉTHÍNG
, I get SOMETHIng
as the result. How could I modify the query so I get the other results as expected? Performance is not critical here since the page
table contains only ~2K rows.
This is the table definition with the relevant bits:
CREATE TABLE page (
(...)
page_title VARCHAR(255) NOT NULL DEFAULT '' COLLATE latin1_bin,
(...)
UNIQUE INDEX name_title (page_namespace, page_title),
)
The table definition must not be modified, since this is a stock installation of MediaWiki and AFAIK its code expects this field being defined that way (i.e. unicode stored as binary data).
The MediaWiki TitleKey extension is basically designed for this, but it only does case-folding. However, if you don't mind hacking it a bit, and have the PHP iconv extension installed, you could edit TitleKey_body.php and replace the method:
static function normalize( $text ) {
global $wgContLang;
return $wgContLang->caseFold( $text );
}
with e.g.:
static function normalize( $text ) {
return strtoupper( iconv( 'UTF-8', 'US-ASCII//TRANSLIT', $text ) );
}
and (re)run rebuildTitleKeys.php.
The TitleKey extension stores its normalized titles in a separate table, surprisingly named titlekey
. It's intended to accessed through the MediaWiki search interface, but if you want, you can certainly query it directly too, e.g. like this:
SELECT page.* FROM page
JOIN titlekey ON tk_page = page_id
WHERE tk_namespace = 0 AND tk_key = 'SOMETHING';
I found the perfect solution, no modyfing or creating tables. It might have performance implications (I didn't test), but as I stated in my question, it's a ~2K rows table so it shouldn't matter much.
The root of the problem is that MediaWiki stores UTF8-encoded text in latin1-encoded tables. It doesn't matter much to MediaWiki since it's aware of it and it'll always query the database with the correct charset and do its thing, essentially using MySQL as a dumb bit container. It does this because apparently UTF8 support in MySQL is not adequate for its needs (see comments in MediaWiki's DefaultSettings.php
, variable $wgDBmysql5
).
The problem appears when you want the database itself to be able to perform UTF8-aware operations (like I wanted to do in my question). You won't be able to do that because as far as MySQL knows, it's not storing UTF8-encoded text (although it is, as explained in the previous paragraph).
There's an obvious solution for this: cast to UTF8 the column you want to use, something like this CONVERT(col_name USING utf8)
. The problem here is that MySQL is trying to be dangerously helpful: it thinks that col_name
is storing latin1-encoded text and it will translate (not encode) each byte into its UTF8 equivalent, and you will end with double-encoded UTF8, which is obviously wrong.
How to avoid MySQL being so nice and helpful? Just cast to BINARY before doing the conversion to UTF8! That way MySQL won't assume anything and will do exactly as asked: encoding this bunch of bits into UTF8. The exact syntax is CONVERT(CAST(col_name AS BINARY) USING utf8)
.
So this is my final query now:
SELECT CONVERT(CAST(page_title AS BINARY) USING utf8)
FROM page
WHERE
CONVERT(CAST(page_title AS BINARY) USING utf8)
LIKE '%keyword_here%'
COLLATE utf8_spanish_ci
Now if I search something
or sôMëthîNG
or any variation, I get all the results!
Please note that I used utf8_spanish_ci
because I want the search to differentiate ñ
from n
but not á
from a
. Use a different collation according to your use case (here is a complete list).
Related links:
Case insensitive: you can simply let the database do the work for you (you already do with _ci)
Accents: In order to have all accents or at least all known accents you could use two rows in your database. The first row stores the result as it is (it means you store SomêthÏng) and you create additionally a second search_row which would in this case contain the string something (without any accents). For the conversion you can make a function with replace rules.
Now you can convert the search string using the conversion function.
The last step is, you make a trigger, which fills/updates the field search_row every time you insert/update the title in the table page.
This solution wouldn't have any negative impact on the performance either!
来源:https://stackoverflow.com/questions/16014167/how-to-do-an-accent-and-case-insensitive-search-in-mediawiki-database