Detect encoding and make everything UTF-8

前端 未结 24 2751
暗喜
暗喜 2020-11-22 03:03

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

24条回答
  •  没有蜡笔的小新
    2020-11-22 03:06

    Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.

    Here's some pseudocode of what you did:

    $inputstring = getFromUser();
    $utf8string = iconv($current_encoding, 'utf-8', $inputstring);
    $flawedstring = iconv($current_encoding, 'utf-8', $utf8string);
    

    You should try:

    1. detect encoding using mb_detect_encoding() or whatever you like to use
    2. if it's UTF-8, convert into ISO 8859-1, and repeat step 1
    3. finally, convert back into UTF-8

    That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.

    This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.

    The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).

提交回复
热议问题