Splitting a column in a DataFrame based on multiple possible delimiters

让人想犯罪 __ 提交于 2021-01-28 18:41:53

问题


I have an address column in a dataframe in pandas with 3 types of information namely street, colony and city.

There are three values with two possible delimiters - either a ',' or a white-space e.g it can be either Street1,Colony1,City1 or Street1 Colony1 City1.

I need to split this column into three with respective labels 'Street','Colony' and 'City' with the values from this Address column split accordingly.

What is the most efficient way to do this as the pandas split function only allows you with a single delimiter or a regex expression (maybe a regex expression for this as I'm not very good with regex).


回答1:


If you are certain it is either a comma , or a whitespace you could use:

df[['Street','Colony','City']] = df.address.str.split('[ ,]', expand=True)

Explanation: str.split accepts a pat (pattern) parameter: String or regular expression to split on. If not specified, split on whitespace. Using the fact we can pass a regular expression this becomes an easy task as [ ,] in regex means either or ,.

An alternative would be to use ' |,' or if you can have multiple whitespace '\s+|,'


Full example:

import pandas as pd

df = pd.DataFrame({
    'address': ['a,b,c','a b c']
})

df[['Street','Colony','City']] = df.address.str.split('[ ,]', expand=True)

print(df)

Returns:

  address Street Colony City
0   a,b,c      a      b    c
1   a b c      a      b    c



回答2:


One way to accomplish this would be to use re.sub to consolidate your delimiters, then use str.split on that single delimiter to create your new columns.

import pandas as pd 
import re

df = pd.DataFrame({'address':['Street1,Colony1,City1',  'Street2 Colony2 City2']})

location_df = (df.address
                 .apply(lambda x: pd.Series(re.sub(pattern=' |,', 
                                                   repl=',', 
                                                   string=x).split(','), 
                                            index=['street','colony','city']))
                )



回答3:


Try this

df[['Street','Colony','City']] = df.address.apply(lambda x: pd.Series(re.split('\W',x)))

\W will match any character which is not word character. See docs



来源:https://stackoverflow.com/questions/52794287/splitting-a-column-in-a-dataframe-based-on-multiple-possible-delimiters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!