发表新帖

发表新帖

Detect encoding and make everything UTF-8

前端未结

关注

 24  2751

暗喜 2020-11-22 03:03

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

24条回答

没有蜡笔的小新 (楼主)

2020-11-22 03:06
Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.

Here's some pseudocode of what you did:
```
$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);
```
You should try:
1. detect encoding using mb_detect_encoding() or whatever you like to use
2. if it's UTF-8, convert into ISO 8859-1, and repeat step 1
3. finally, convert back into UTF-8
That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.

This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.

The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).
0 讨论(0)

查看其它24个回答
发布评论:

提交评论
- 加载中...

热议问题