Using collation xxx_german2_ci which treats ü and ue as identical, is it possible to have all occurences of München be hi
In the end I decided to do it all in PHP, therefore my question about which characters are equal with utf8_general_ci.
Below is what I came up with, by example: A label is constructed from a text
$description, with sub strings $term highlighted, and special characters
converted. Substitution is not complete, but probably sufficient for the actual
use case.
mb_internal_encoding("UTF-8");
function withoutAccents($s) {
return strtr(utf8_decode($s),
utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿß'),
'aaaaaceeeeiiiinooooouuuuyys');
}
function simplified($s) {
return withoutAccents(strtolower($s));
}
function encodedSubstr($s, $start, $length) {
return htmlspecialchars(mb_substr($s, $start, $length));
}
function labelFromDescription($description, $term) {
$simpleTerm = simplified($term);
$simpleDescription = simplified($description);
$lastEndPos = $pos = 0;
$termLen = strlen($simpleTerm);
$label = ''; // HTML
while (($pos = strpos($simpleDescription,
$simpleTerm, $lastEndPos)) !== false) {
$label .=
encodedSubstr($description, $lastEndPos, $pos - $lastEndPos).
''.
encodedSubstr($description, $pos, $termLen).
'';
$lastEndPos = $pos + $termLen;
}
$label .= encodedSubstr($description, $lastEndPos,
strlen($description) - $lastEndPos);
return $label;
}
echo labelFromDescription('São Paulo ', 'SAO')."\n";
echo labelFromDescription('München ', 'ünc');
Output:
São Paulo <SAO>
München <MUC>