How to convert one particular text column in data-frame to 'utf-8' using python3

社会主义新天地 提交于 2020-06-29 03:42:25

问题


I have a dataframe which multiple columns and one column contains scrapped text from various links. I tried to convert that column to utf-8 but it didn't work.

Here is my approach:

df = pd.read_excel('data.xlsx',encoding=sys.getfilesystemencoding())
df['text'] = df['text'].apply(lambda x: x.encode('utf-8').strip())
print(df['text'])

I get text with some ascii code :

b"b'#Thank you, it\xe2\x80\x99s good to be ...

df = pd.read_excel('data.xlsx',encoding=sys.getfilesystemencoding())
df['text'] = df['text']
print(df['text'])

I get the text:

b'#Thank you, it\xe2\x80\x99s good to be here....

df['text'] = df['text'].apply(lambda x: x.decode('utf-8').strip())

AttributeError: 'str' object has no attribute 'decode'

I tried 2-3 approaches but it didn't work. Any alternative?

Using Python 3.6 and jupyter notebook.


回答1:


Assuming what you wrote for the example where the second line is df['text'] = df['text'] ends in '. In other words, b'#Thank you, it\xe2\x80\x99s good to be here....':

For some reason you have byte code that has been cast to a string because you see AttributeError: 'str' object has no attribute 'decode' when you try to decode it. (Ideally, it would be best to have not gotten into this situation, see here for some advice that looks to be pertinent. Alas, going with what you have ... )
I think at this point you can remove the b' at the start of the string and ' at the end far end and typecast back to byte code. Note that this will result in the backslashes getting escaped, and so that needs be dealt with, in addition to now decoding the byte code to a string in the proper way. Using an approach based on here you can escape and decode the byte code.

Putting this together (sort of like how @rolf82 illustrated in the comments) with what you show as df['text'], when df['text'] = df['text'] and that it is a string at the start, the conversion from what you have would be like this:

a = "b'#Thank you, it\xe2\x80\x99s good to be here'"
# But we only want the parts between the ''.
s = bytes(r"#Thank you, it\xe2\x80\x99s good to be here","utf-8")
import codecs
print(codecs.escape_decode(s)[0].decode("utf-8"))

That gives:

#Thank you, it’s good to be here

Which is what we want.

Now integrating that with Pandas is going to require something extra because we cannot simply say this is a raw string by adding r in front. Based on here and here, it seems using r in front to cast to raw string can be replaced with .encode('unicode-escape').decode(), like:

"#Thank you, it\xe2\x80\x99s good to be here".encode('unicode-escape').decode()

So pulling it all together I'd replace your second line with this:

import codecs
df['text'] = df['text'].apply(lambda x: codecs.escape_decode(bytes(x[2:-1].encode('unicode-escape').decode(), "utf-8"))[0].decode('utf-8').strip())

If that doesn't work, also try leaving off the .decode() after .encode('unicode-escape'), which is:

```python
import codecs
df['text'] = df['text'].apply(lambda x: codecs.escape_decode(bytes(x[2:-1].encode('unicode-escape'), "utf-8"))[0].decode('utf-8').strip())


来源:https://stackoverflow.com/questions/60640682/how-to-convert-one-particular-text-column-in-data-frame-to-utf-8-using-python3

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!