问题
Similar to this question I am consuming an XML product that has some illegal chars in it. I seriously doubt I can get them to fix the problem, but I will try. In the meantime I'd like a work-around.
The problem is that it contains a bullet. It renders as "•" in my source. I've tried a few encoding conversions but have not found a combination that works. (I'm not accustomed to even thinking about my encoding type, so I'm out of my element here.) So, I tried the below and it seems that str_replace does not recognize the "•". (it renders as tall block in my text editor) You can see the commented lines where I tried a few different things.
I tried str replace on "•" first, then tweaked around and this is my latest:
// deal with bullets in XML.
$bullet="•"; //this was copied and pasted from transliterated text.
//$data=iconv( "UTF-8", "windows-1252//TRANSLIT", $data ); //transliterate the text:
//$data=str_replace($bullet,'•',$data); // replace the bullet char
$data=str_replace($bullet,' - ',$data); // replace the bullet char
//$data=iconv( "windows-1252", "UTF-8", $data ); // return the text to utf-8 encoding.
Any ideas how to strip or replace this char? If there's a function to pre-clean the XML, that'd be great, and I wouldn't have to worry about it.
回答1:
XML by definition has no illegal chars. If some string contains a character that is not part of XML, then that string is not XML by definition.
The character you're concerned about is part of Unicode. As XML is based on Unicode, this is good news. So let's name what you aim for:
- Unicode Character 'BULLET' (U+2022)
So you now say it renders as •
. Because U+2022 is encoded as 0xE2 0x80 0xA2 in UTF-8, it is a more or less safe assumption to say that you take an UTF-8 encoded string (that is the default encoding used in XML btw) but command the software that renders it to treat it as some single-byte encoding hence turning the single code-point into three different characters:
- Unicode Character 'LATIN SMALL LETTER A WITH CIRCUMFLEX' (U+00E2)
- Unicode Character 'EURO SIGN' (U+20AC)
- Unicode Character 'CENT SIGN' (U+00A2)
Instead you need to command the rendering application to use the UTF-8 encoding. That should immediately solve your issue. So find the place where you introduce the wrong encoding, you will likely not need to re-encode it, just to properly hint the encoding.
If you wonder which single-byte character-encodings have these three Unicode Characters at the corresponding bytes (0xE2 0x80 0xA2), here is a list. I have highlighted the most popular one of these:
- ISO-8859-15 (Latin 9)
- OEM 858 (Multilingual Latin I + Euro)
- Windows 1252 (Latin I)
- Windows 1254 (Turkish)
- Windows 1256 (Arabic)
- Windows 1258 (Vietnam)
来源:https://stackoverflow.com/questions/16020488/bullet-in-xml