Transform Pandas string column containing unicodes to ascii to load urls

孤街浪徒 提交于 2021-02-05 12:12:31

问题


I have a pandas DataFrame containing a column with Wikipedia urls, that I want to load. However, some strings won't load because they contain unicodes. For example, 'Kruskal %E2%80%93Wallis_one-way_analysis_of_variance' raises the following

PageError: Page id "Cauchy%E2%80%93Schwarz_inequality" does not match any      pages. Try another id!

Is there a way to turn all unicodes into ascii? So in this case, I need a function that can create a new column:

old column                            new column
Cauchy%E2%80%93Schwarz_inequality     Cauchy–Schwarz_inequality
Markov%27s_inequality                 Markov's_inequality

回答1:


urllib.parse.unquote should do the trick. Hope this helps.

In [1]: import urllib
   ...: 
   ...: import pandas as pd
   ...: 
   ...: 
   ...: df = pd.DataFrame({'url': ['Markov%27s_inequality', 'Cauchy%E2%80%93Schwarz_inequality']})
   ...: df['clean_url'] = df['url'].apply(urllib.parse.unquote)
   ...: 

In [2]: df
Out[2]: 
                                 url                  clean_url
0              Markov%27s_inequality        Markov's_inequality
1  Cauchy%E2%80%93Schwarz_inequality  Cauchy–Schwarz_inequality


来源:https://stackoverflow.com/questions/50837619/transform-pandas-string-column-containing-unicodes-to-ascii-to-load-urls

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!