How to work with unicode in Python

后端未结

关注

 6  1110

I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various \'converters\' and am starting to lean tow

相关标签:

6条回答

闹比i

2020-12-31 06:07
You can convert it to unicode in this way:
```
print u'Hello, \xa0World'  # print Hello,  World
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-12-31 06:13

Look at the codecs standard library, specifically the encode and decode methods provided in the Codec base class.

There's also a good article here that puts it all together.

0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-12-31 06:14
Instead of this, it's better to use standard python features.

For example:
```
string = unicode('Hello, \xa0World', 'utf-8', 'replace')
```
or
```
string = unicode('Hello, \xa0World', 'utf-8', 'ignore')
```
where replace will replace \xa0 to \\xa0.

But if \xa0 is really not meaningful for you and you want to remove it then use ignore.
0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2020-12-31 06:14
Just a note regarding HTML cleaning. It is very very hard, since
```
<
body
>
```
Is a valid way to write HTML. Just an fyi.
0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2020-12-31 06:15
may be you should be doing
```
s=unicodestring.replace(u'\xa0',u'')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2020-12-31 06:17
```
s=unicodestring.replace('\xa0','')
```
..is trying to create the unicode character \xa0, which is not valid in an ASCII sctring (the default string type in Python until version 3.x)

The reason r'\xa0' did not error is because in a raw string, escape sequences have no effect. Rather than trying to encode \xa0 into the unicode character, it saw the string as a "literal backslash", "literal x" and so on..

The following are the same:
```
>>> r'\xa0'
'\\xa0'
>>> '\\xa0'
'\\xa0'
```
This is something resolved in Python v3, as the default string type is unicode, so you can just do..
```
>>> '\xa0'
'\xa0'
```
I am trying to clean all of the HTML out of a string so the final output is a text file

I would strongly recommend BeautifulSoup for this. Writing an HTML cleaning tool is difficult (given how horrible most HTML is), and BeautifulSoup does a great job at both parsing HTML, and dealing with Unicode..
```
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<html><body><h1>Hi</h1></body></html>")
>>> print soup.prettify()
<html>
 <body>
  <h1>
   Hi
  </h1>
 </body>
</html>
```
0 讨论(0)
发布评论:

提交评论
- 加载中...