I use CURL to get content from another site, but i don't know why it's auto convert from UTF-8 to ISO 8859-1, like follow:
site: abc.com:
Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP
But when i use CURL get content from that site, i got follow:
Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP
So how to convert it's become to UTF-8 ?
I'd recommend using iconv
.
iconv --list
gives you a list of all known encodings, and you can then use iconv -f FROM_ENCODING -t TO_ENCODING
do do your conversion. It can also read from stdin and therefore be plugged to curl
.
But regarding the comment you got for your question: It seems like the file author didn't care about using the correct encoding and decided to stick with (old-style?) ä
and stuff.
Take your string in variable and use following function.
$var = "";
echo utf8_encode($var);
Judging from the line you pasted, the problem appears to be with HTML entities, not with character enconding. The encoded chars look fine to me.
You need to translate those HTML entities to encoded chars. Which tool to use will depend of your enviroment or programming language. I don't think it can be done with CURL alone.
PHP has htmlspecialchars_decode(). Python unescape() from the HTMLParser module.
curl does not convert anything, downloads things "as is"
What you see are character entities, valid html, and the browser that the conversion to a readable form.
You can check this by opening the file saved by curl in a browser. It will look like the live page.
Your files aren’t being converted to another encoding. They’re using HTML character entities. You need to convert those entities, such as é
to UTF-8, such as é. This takes one extra line of code after you convert to UTF-8, if you even need to do that.
来源:https://stackoverflow.com/questions/8253914/how-to-convert-iso-8859-1-characters-to-utf-8