Decode Characters Pandas

落爺英雄遲暮 提交于 2021-01-27 13:33:10

问题


Below is a sample of my DF

ROLE                        NAME
GESELLSCHAFTER              DUPONT DUPONT
GESCHäFTSFüHRER           DUPONT DUPONT
KOMPLEMENTäR               DUPONT DUPONT
GESELLSCHAFTER              DUPONT DUPONT
KOMPLEMENTäR               DUPONT DUPONT

The aim would be to fix the special characters.
For eg, 'KOMPLEMENTäR'--> should be 'KOMPLEMENTAR' (with or without the Accent doesn't really matter)

Thus, I tried to construct a list and replace the value name by the below dic list.

{'A¤':'A', 'A–':'A', 'A¶':'A', 'A€':'A', 'Aƒ':'A', 'A„':'A', 'A\…':'A', 'A¡':'A', 
'A¢':'A', 'A£':'A', 'A¥':'A', 'A¦':'A', 
'A©':'E', 'Aˆ':'E', 'A‰':'E', 'AŠ':'E', 'A‹':'E', 
'AŒ':'I', 'AŽ':'I', 'A¬':'I', 'A­':'I', 'A®':'I', 'A¯':'I',
'A“':'O', 'A”':'O', 'A•':'O', 'A–':'O', 'A°':'O', 'A²':'O', 'A³':'O', 'A´':'O', 'Aµ':'O', 'A¶':'O',
 'A¼':'U', 'A™':'U', 'Aš':'U', 'Aœ':'U', 'A¹':'U', 'Aº':'U', 'A»':'U', 'ÿ':'U'}

for key, value in dic.items():
        df['ROLE'] = df['ROLE'].str.replace(key, value)

However, I was wondering if there is a better way of dealing with this issue using regular expression perhaps?

Below is a solution found when printing.

nfd_example = 'KOMPLEMENTäR'
print(nfd_example.encode('cp1252').decode('utf-8-sig'))
output
KOMPLEMENTäR

Besides, when I try the same code on Pandas column, here is the output error:

df['ROLE_decode'] = df['ROLE'].str.encode('cp1252').str.decode('utf-8-sig')
'utf-8' codec can't decode byte 0xc4 in position 6: invalid continuation byte

EDIT

Below is list of Unique Values from Role Column

AKTIONäR                                 133
AKTIONÄR                                   11
AUFSICHTSRAT                              450
AUSüBENDE PERSON                         688
AUSÜBENDE PERSON                          131
DEFAULT KEY                                62
GESCHäFTSFüHRENDER DIREKTOR               2
GESCHäFTSFüHRER                        9555

When Using the below code

df['ROLE_decode'] = df['ROLE'].str.encode('cp1252').str.decode('utf-8-sig', 'ignore').apply(unidecode)

It gives me the below Unique Values

AKTIONR                                   11
AKTIONaR                                 133
AUFSICHTSRAT                             450
AUSBENDE PERSON                          131
AUSuBENDE PERSON                         688
DEFAULT KEY                               62
GESCHFTSFHRER                            797
GESCHaFTSFuHRENDER DIREKTOR                2
GESCHaFTSFuHRER                         9555

So, if anyone has an idea, thanks for your help!


回答1:


You can pass regex=True to replace:

# the included dic seems to have `A` instead of 'Ã'
dic ={'ü':'U', 'ä':'A'}

df['ROLE'] = df['ROLE'].replace(dic, regex=True)

Output:

              ROLE           NAME
0   GESELLSCHAFTER  DUPONT DUPONT
1  GESCHAFTSFUHRER  DUPONT DUPONT
2     KOMPLEMENTAR  DUPONT DUPONT
3   GESELLSCHAFTER  DUPONT DUPONT
4     KOMPLEMENTAR  DUPONT DUPONT



回答2:


This solution is quite long and might not work well on a large dataset, first decompose using unicodedata then encode to ascii to remove the accents and decode to utf-8

from unicodedata import normalize
df.ROLE.apply(lambda x: normalize('NFD', x).encode(
    'ascii', 'ignore').decode('utf-8-sig'))

# 0                       AKTIONAR
# 1                       AKTIONAR
# 2                   AUFSICHTSRAT
# 3               AUSABENDE PERSON
# 4               AUSUBENDE PERSON
# 5                    DEFAULT KEY
# 6    GESCHAFTSFAHRENDER DIREKTOR
# 7                GESCHAFTSFAHRER
# Name: ROLE, dtype: object


来源:https://stackoverflow.com/questions/61680599/decode-characters-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!