How to convert text with HTML entites and invalid characters to it's UTF-8 equivalent?

I am changing the title because I was unaware of special broken windows characters that caused me problems, making the question look like a duplicate.

How to convert HTML entities, character references of type &#[0-9]+; and &#x[a-fA-F0-9]+;, invalid character references and invalid windows characters chr(151) to their UTF-8 equivalents?

Basically how to clean up some very bad text of variable encoding and save it as UTF-8?

original question below

Convert &#[0-9]+; and &#x[a-fA-F0-9]+; references to UTF-8 equvalents?

for example

&#151;
&#x97;

to —

like a browser does it, but with php.

edit: even the non-standard ones that windows made but browsers still display.

Timo Huovinen

Answering my own question with the solution that I used in the end

The problem:

I needed to replace html entities and decimal and hexadecimal character references that looked like this ‚ and ‚ and &#emdash; to their UTF-8 equvalents, like a normal browser would, and convert the text into UTF-8.

The problem was that there were often references that were in the range of 130-150 and x82-x9F, which as thirtydot has found out were invalid windows word characters that people use with ASCII text for special characters like emdashes, which are not supported by php's html_entity_decode.

You would think that these invalid characters would not work in browsers, but it looks like browsers made a silent undocumented agreement to fix these characters and display them properly anyway.

While trying to fix these references I also found out that the actual characters like <?php echo chr(151);?> were also being used, which were probably directly copied from word, and would cause all sorts of problems, so I needed them to be fixed too.

What most answers that I found regarding encodings fail to mention is that the solution to encoding related problems often largely depends on the encoding used. Here is an example:

The invalid windows character chr(151) will work with "ISO-8859-1" encoded text, and Josh B mentions as per Jukka Korpelas suggestion that you should fix them like this:

$str = str_replace(chr(151),'--',$str);

What it does is replace the windows character to a safe ASCII alternative, but knowing that the text will be stored in UTF-8, I did not want to loose the original characters. While changing them like this was not an option because ASCII does not support the proper Unicode character:

$str = str_replace(chr(151),chr(8218),$str);

So what I did instead was to first replace the character to its html reference (While the $str was "ISO-8859-1" encoded:

$str = str_replace(chr(151),'&#8218;'),$str);

Then I change the encoding

$str = iconv('ISO-8859-1', 'UTF-8//IGNORE', $str);//convert to UTF-8

And finally I turn all the entities and character references to pure UTF-8 with my "html_character_reference_decode" function that is largely based on Gumbos solution, which also fixes the bad windows references, but only uses preg_replace_callback to go over the bad windows characters.

function fix_char_mapping($match){
    if (strtolower($match[1][0]) === "x") {
        $codepoint = intval(substr($match[1], 1), 16);
    } else {
        $codepoint = intval($match[1], 10);
    }
    $mapping = array(8218,402,8222,8230,8224,8225,710,8240,352,8249,338,141,142,143,144,8216,8217,8220,8221,8226,8211,8212,732,8482,353,8250,339,157,158,376); 
    $codepoint = $mapping[$codepoint-130];
    return '&#'.$codepoint.';';
}
function html_character_reference_decode($string, $encoding='UTF-8', $fixMappingBug=true){
    if($fixMappingBug){
        $string = preg_replace_callback('/&#(1[3-5][0-9]|x8[2-9a-f]|x9[0-9a-f]);/i','fix_char_mapping',$string);
    }
    return html_entity_decode($string, ENT_QUOTES, 'UTF-8');
}
header('Content-Type: text; charset=UTF-8');
echo  html_character_reference_decode('dash &#151; and another dash &#x97; text &#x5D5; and more tests &#x5E0;&#x5D5;&#x5E3; ');

So if your text is "ISO-8859-1" encoded, the complete solution looks like this:

<?php
header('Content-Type: text/plain; charset=utf-8');
ini_set("default_charset", 'utf-8');
error_reporting(-1);
$encoding = 'ISO-8859-1';//put encoding here
$str = '&#x9F; &#x9C; bad&#150;string: '.chr(151);//ASCII
if($encoding==='ISO-8859-1'){
//fix bad windows characters
$badchars = array(
'&#130;'=>chr('130'),//',' baseline single quote
'&#131;'=>chr('131'),//'NLG' florin
'&#132;'=>chr('132'),//'"' baseline double quote
'&#133;'=>chr('133'),//'...' ellipsis
'&#134;'=>chr('134'),//'**' dagger (a second footnote)
'&#135;'=>chr('135'),//'***' double dagger (a third footnote)
'&#136;'=>chr('136'),//'^' circumflex accent
'&#137;'=>chr('137'),//'o/oo' permile
'&#138;'=>chr('138'),//'Sh' S Hacek
'&#139;'=>chr('139'),//'<' left single guillemet
'&#140;'=>chr('140'),//'OE' OE ligature
'&#145;'=>chr('145'),//"'" left single quote
'&#146;'=>chr('146'),//"'" right single quote
'&#147;'=>chr('147'),//'"' left double quote
'&#148;'=>chr('148'),//'"' right double quote
'&#149;'=>chr('149'),//'-' bullet
'&#150;'=>chr('150'),//'-' endash
'&#151;'=>chr('151'),//'--' emdash
'&#152;'=>chr('152'),//'~' tilde accent
'&#153;'=>chr('153'),//'(TM)' trademark ligature
'&#154;'=>chr('154'),//'sh' s Hacek
'&#155;'=>chr('155'),//'>' right single guillemet
'&#156;'=>chr('156'),//'oe' oe ligature
'&#159;'=>chr('159'),//'Y' Y Dieresis
);
$str = str_replace(array_values($badchars),array_keys($badchars),$str);
$str = iconv('ISO-8859-1', 'UTF-8//IGNORE', $str);//convert to UTF-8
$str = html_character_reference_decode($str);//fixes bad entities above
echo $str;die;
}

It was tested with a wide range of situations and looks like it works.

Lets look at the same situation with UTF-8 encoded text that contains bad windows characters.

One reliable way to test for the presence of bad characters or "badly formed UTF-8" was to use iconv, it is slow, but was more reliable than using preg_match in my tests:

$cleaned = iconv('UTF-8','UTF-8//IGNORE',$str);
if ($cleaned!==$str){
    //contains bad characters, use cleaned version where the bad characters were stripped
    $str = $cleaned;
}

This was pretty much the best I could think of, as I found no reasonable way to find and replace the bad windows characters in UTF-8 text, let me explain why.

lets take a string with a perfectly valid unicode character $str = "—".chr(151); and a bad windows emdash.

I don't know what bad windows characters might be present in the UTF-8 string, only that they might be present.

Using str_replace to try and fix the bad windows character chr(148) (right double quote) in the above valid emdash string which does not even contain any double quotes will result in a scrambeled character, at first I thought that str_replace might not be multibyte safe, and tried using mb_eregi_replace but the problem was the same.

The comments on the php website and stackoverflow mention that str_replace is binary safe, and works fine with well formed UTF-8 text, because of the way that UTF-8 was designed.

Why it breaks

It figures that the bad windows character chr(148) is made up of the following bits "10010100", while the (emdash character)(http://www.fileformat.info/info/unicode/char/2014/index.htm), which according to the fileformat website is made up of 3 bytes: "11100010:10000000:10010100"

Notice that the bits in the last byte in the perfectly valid UTF-8 character match the bits in the bad windows right double quote, so str_replace just replaces the last byte, breaking the UTF-8 character. This problem happens with lots of unicode characters, and would scramble lots of characters in russian text for example.

This can't happen with ASCII text because each character is always made up of a single byte.

So when you get an UTF-8 string, that contains any amount of multibyte characters, you can no longer safely fix the bad windows characters, and the only solution I found was to strip them with iconv

$str = iconv('UTF-8', 'UTF-8//IGNORE', $str);

The only solution that I can think of

Although you can always replace the valid unicode characters that contain a byte of the bad characters to their encoded counterparts, then replace the bad characters and then decode the good characters, thus keeping everything :)

like this: