I\'m looking for general a strategy/advice on how to handle invalid UTF-8 input from users.
Even though my webapp uses UTF-8, somehow some users enter invalid chara
I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down. Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters. The data you store in your database then is data triggered by the user, but not actually user-supplied data.
EDIT #4: Replacing bad character with entiy: �
EDIT #3: Updated : Sept 22 2010 @ 1:32pm Reason: Now string returned is UTF-8, plus I used the test file you provided as proof.
$val){
// print ord($val);
// print '
';
// }
// print '
';
//*/
//
// //test case #1
//
// $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv '.chr(160).chr(127).chr(126);
//
// $string = teststr($alpha,$str);
// print $string;
// print '
';
//
// //test case #2
//
// $str = ''.'©?™???';
// $string = teststr($alpha,$str);
// print $string;
// print '
';
//
// $str = '©';
// $string = teststr($alpha,$str);
// print $string;
// print '
';
$file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
$testfile = implode(chr(10),file($file));
$string = teststr($alpha,$testfile);
print $string;
print '
';
function teststr(&$alpha, &$str){
$strlen = strlen($str);
$newstr = chr(0); //null
$x = 0;
if($strlen >= 2){
for ($i = 0; $i < $strlen; $i++) {
$x++;
if(in_array($str[$i],$alpha)){
// passed
$newstr .= $str[$i];
}else{
// failed
print 'Found out of scope character. (ASCII: '.ord($str[$i]).')';
print '
';
$newstr .= '�';
}
}
}elseif($strlen <= 0){
// failed to qualify for test
print 'Non-existent.';
}elseif($strlen === 1){
$x++;
if(in_array($str,$alpha)){
// passed
$newstr = $str;
}else{
// failed
print 'Total character failed to qualify.';
$newstr = '�';
}
}else{
print 'Non-existent (scope).';
}
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8"){
// skip
}else{
$newstr = utf8_encode($newstr);
}
// test encoding:
if(mb_detect_encoding($newstr, "UTF-8")=="UTF-8"){
print 'UTF-8 :D
';
}else{
print 'ENCODED: '.mb_detect_encoding($newstr, "UTF-8").'
';
}
return $newstr.' (scope: '.$x.', '.$strlen.')';
}