问题
Below is a sample of my DF
ROLE NAME
GESELLSCHAFTER DUPONT DUPONT
GESCHäFTSFüHRER DUPONT DUPONT
KOMPLEMENTäR DUPONT DUPONT
GESELLSCHAFTER DUPONT DUPONT
KOMPLEMENTäR DUPONT DUPONT
The aim would be to fix the special characters.
For eg, 'KOMPLEMENTäR'--> should be 'KOMPLEMENTAR' (with or without the Accent doesn't really matter)
Thus, I tried to construct a list and replace the value name by the below dic list.
{'A¤':'A', 'A–':'A', 'A¶':'A', 'A€':'A', 'Aƒ':'A', 'A„':'A', 'A\…':'A', 'A¡':'A',
'A¢':'A', 'A£':'A', 'A¥':'A', 'A¦':'A',
'A©':'E', 'Aˆ':'E', 'A‰':'E', 'AŠ':'E', 'A‹':'E',
'AŒ':'I', 'AŽ':'I', 'A¬':'I', 'A':'I', 'A®':'I', 'A¯':'I',
'A“':'O', 'A”':'O', 'A•':'O', 'A–':'O', 'A°':'O', 'A²':'O', 'A³':'O', 'A´':'O', 'Aµ':'O', 'A¶':'O',
'A¼':'U', 'A™':'U', 'Aš':'U', 'Aœ':'U', 'A¹':'U', 'Aº':'U', 'A»':'U', 'ÿ':'U'}
for key, value in dic.items():
df['ROLE'] = df['ROLE'].str.replace(key, value)
However, I was wondering if there is a better way of dealing with this issue using regular expression perhaps?
Below is a solution found when printing.
nfd_example = 'KOMPLEMENTäR'
print(nfd_example.encode('cp1252').decode('utf-8-sig'))
output
KOMPLEMENTäR
Besides, when I try the same code on Pandas column, here is the output error:
df['ROLE_decode'] = df['ROLE'].str.encode('cp1252').str.decode('utf-8-sig')
'utf-8' codec can't decode byte 0xc4 in position 6: invalid continuation byte
EDIT
Below is list of Unique Values from Role Column
AKTIONäR 133
AKTIONÄR 11
AUFSICHTSRAT 450
AUSüBENDE PERSON 688
AUSÜBENDE PERSON 131
DEFAULT KEY 62
GESCHäFTSFüHRENDER DIREKTOR 2
GESCHäFTSFüHRER 9555
When Using the below code
df['ROLE_decode'] = df['ROLE'].str.encode('cp1252').str.decode('utf-8-sig', 'ignore').apply(unidecode)
It gives me the below Unique Values
AKTIONR 11
AKTIONaR 133
AUFSICHTSRAT 450
AUSBENDE PERSON 131
AUSuBENDE PERSON 688
DEFAULT KEY 62
GESCHFTSFHRER 797
GESCHaFTSFuHRENDER DIREKTOR 2
GESCHaFTSFuHRER 9555
So, if anyone has an idea, thanks for your help!
回答1:
You can pass regex=True to replace:
# the included dic seems to have `A` instead of 'Ã'
dic ={'ü':'U', 'ä':'A'}
df['ROLE'] = df['ROLE'].replace(dic, regex=True)
Output:
ROLE NAME
0 GESELLSCHAFTER DUPONT DUPONT
1 GESCHAFTSFUHRER DUPONT DUPONT
2 KOMPLEMENTAR DUPONT DUPONT
3 GESELLSCHAFTER DUPONT DUPONT
4 KOMPLEMENTAR DUPONT DUPONT
回答2:
This solution is quite long and might not work well on a large dataset, first decompose using unicodedata then encode to ascii to remove the accents and decode to utf-8
from unicodedata import normalize
df.ROLE.apply(lambda x: normalize('NFD', x).encode(
'ascii', 'ignore').decode('utf-8-sig'))
# 0 AKTIONAR
# 1 AKTIONAR
# 2 AUFSICHTSRAT
# 3 AUSABENDE PERSON
# 4 AUSUBENDE PERSON
# 5 DEFAULT KEY
# 6 GESCHAFTSFAHRENDER DIREKTOR
# 7 GESCHAFTSFAHRER
# Name: ROLE, dtype: object
来源:https://stackoverflow.com/questions/61680599/decode-characters-pandas