Filtering text encoded with utf-8 to only contain latin alphabet characters

随声附和 提交于 2020-01-05 05:11:08

问题


I'm trying to filter textdata to only contain latin characters, for further text analyzing. The original textsource most likely contained Korean Alphabet. This shows up like this in the text file:

\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION

What would be the fastest/easiest/most complete way to get remove these? I tried making a script that would remove all \xXX combinations, but it turns out that there are to many exceptions for this to be reliable.

Is there a way to remove all none latin characters from utf-8 encoded text?

Thanks in advance.

SOLUTION:

import string

textin = b'\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION'.decode('UTF-8')
outtext = ''

for char in textin:
    if char in string.printable:
        outtext += char

print(outtext)

my data was decoded to bits for some reason, don't ask me why. :D


回答1:


what about this:

import string

intext = b'<your funny characters>'
outtext = ''

for char in intext.decode('utf-8'):
    if char in string.ascii_letters:
        outtext += char

I'm not sure this is what you want however. For the given intext, outtext is empty. If you append string.digits to string.ascii_letters, outtext is '11'.

(edited to fix a mistake in the code, pointed out by OP)




回答2:


While reading the csv file, try to do the encoding as:

df=pd.read_csv('D:/sample.csv',encoding="utf-8-sig")


来源:https://stackoverflow.com/questions/46059104/filtering-text-encoded-with-utf-8-to-only-contain-latin-alphabet-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!