Filtering text encoded with utf-8 to only contain latin alphabet characters

问题

I'm trying to filter textdata to only contain latin characters, for further text analyzing. The original textsource most likely contained Korean Alphabet. This shows up like this in the text file:

\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION

What would be the fastest/easiest/most complete way to get remove these? I tried making a script that would remove all \xXX combinations, but it turns out that there are to many exceptions for this to be reliable.

Is there a way to remove all none latin characters from utf-8 encoded text?

Thanks in advance.

SOLUTION:

import string

textin = b'\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION'.decode('UTF-8')
outtext = ''

for char in textin:
    if char in string.printable:
        outtext += char

print(outtext)

my data was decoded to bits for some reason, don't ask me why. :D

回答1:

what about this:

import string

intext = b'<your funny characters>'
outtext = ''

for char in intext.decode('utf-8'):
    if char in string.ascii_letters:
        outtext += char

I'm not sure this is what you want however. For the given intext, outtext is empty. If you append string.digits to string.ascii_letters, outtext is '11'.

(edited to fix a mistake in the code, pointed out by OP)

回答2:

While reading the csv file, try to do the encoding as:

df=pd.read_csv('D:/sample.csv',encoding="utf-8-sig")

来源：https://stackoverflow.com/questions/46059104/filtering-text-encoded-with-utf-8-to-only-contain-latin-alphabet-characters

标签

python

encoding

utf-8

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!