Decode � to real character

人盡茶涼 提交于 2020-01-23 06:46:29

问题


when I read data from Stream API of twitter and then write to xmlfile.

But some special character like � will cause error (I mean when I open that xmlfile in Chrome, Chrome said that there was an error at that character!)

I want to convert that encoded sequence (�) into real character () before writing to xmlfile!

How to implement this?

-------------ADDED--------------

This is the XMLFile content:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<text>@carlyraejepsen would be a dream if you follow me, please follow me?, I love you so much you're my inspiration</text>
<text>someone please bring me a caramel apple and a mocha from black cat. i'll love you forever</text>
<text>“@G_MartinFlyKick: Marry me Juliet.I love you and that's all I really know.”&#55357;&#56834;&#55357;&#56834;&#55357;&#56834;&#55357;&#56834;&#55357;&#56834;</text>
<text>"I need to see a picture of him cuz Im trying to imagine you guys making love and all I see is u climbing on top of a big question mark"lmao</text>
<text>@District3music hi, I LOVE YOU follow me please? &amp;lt;3 xx 23</text>
<text>RT @syardley_: So appreciative of my family and people I love, wouldn't be where I am without them. #thankful</text>
<text>#DISTRICT3HALLOWEENFOLLOWSPREE #DISTRICT3HALLOWEENFOLLOWSPREE #3EEKERFROMTHENETHERLANDS love you! Please follow ? @District3music x42</text>
<text>Arguably my favorite electronic music producer @Kluteuk is coming back to Toronto on Dec 22nd. So stoked. Guy has made so many tunes I LOVE.</text>
<text>The stakes are high, the water's rough, but this love is ours.</text>
<text>@NiallOfficial Answer me, I love you very much. Venezuela loves. jhgj</text>
<text>Love this shit http://t.co/qSP79NKx</text>
</root>

And here is error from Chrome:

This page contains the following errors:

error on line 5 at column 91: xmlParseCharRef: invalid xmlChar value 55357
Below is a rendering of the page up to the first error.

回答1:


The character reference &#55357; denotes a surrogate code point (U+D83D), so it would be wrong to try to convert it to a character. It is not a character, not even half a character.

You need to track back to the point where the reference was generated. The reason might be a character encoding confusion. In UTF-16, surrogate code units may appear but must be handled in pairs when the data is interpreted as characters and e.g. converted to another encoding or turned to character references.




回答2:


You can use regular expressions to replace it after the server response. simple example in python:

import re 
pattern = re.compile(r'&#')
new_content = pattern.sub(' ', SERVER_RESPONSE)


来源:https://stackoverflow.com/questions/13165408/decode-55357-to-real-character

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!