How to extract apartment from address in Pandas

本小妞迷上赌 提交于 2021-01-28 11:46:59

问题


I have a rather messy data set that has a lot of inconsistencies and errors due to manually input data.

I'm working on the address column of this dataset in pandas.

What I would like to do is break the address column into 3 separate entities:

1) a column for the address

2) a column for the street number

3) a column for the apartment or unit number

The data looks like the following:

address
----------------------
123 smith street #5234
5000 john ct
34 wood st apt# 23
523 fire road apt #87
charles way apt. 434
0987 misty lane unit B 

I have already removed the street numbers into their own column. For this I used "np.where" using an easy logical condition that if the string started with digits extract them into the new street column.

I am now stuck on how to do this with apartment numbers.

I am assuming that because of the inconsistencies, I have to do something like:

df['apt/unit'] = np.where(str contains "apt", extract string starting at "apt" until end, else np.NaN)
df['apt/unit'] = np.where(str contains "unit", extract string starting at "unit" until end, else np.NaN)

Will I have to use regex to do this? If so, what is the way to go about that?

Are there any alternatives to this line of thinking?


回答1:


Since you have multiple conditions for your apt/unit column, you can use np.select here like the following:

# Define our conditions
conditions = [
    df.address.str.contains('apt'),
    df.address.str.contains('unit'),
    df.address.str.contains('#')
]

# Define our choices based on our conditions
choices = [
    df.address.apply(lambda x: x[x.find('apt'):]),
    df.address.apply(lambda x: x[x.find('unit'):]),
    df.address.apply(lambda x: x[x.find('#'):])
]

# Apply this logic by creating the new column and cleaning up address column
df['apt/unit'] = np.select(conditions, choices, default = '')

# Clean up our address column
choices2 = [
    df.address.apply(lambda x: x[:x.find('apt')]),
    df.address.apply(lambda x: x[:x.find('unit')]),
    df.address.apply(lambda x: x[:x.find('#')])
]
df['address'] = np.select(conditions, choices2, default = df.address)

Output

print(df)

             address  apt/unit
0  123 smith street      #5234
1       5000 john ct          
2        34 wood st    apt# 23
3     523 fire road    apt #87
4       charles way   apt. 434
5   0987 misty lane     unit B



回答2:


Using Series.str.extract, something like:

pat = r'^(?P<number>\d+)?(?P<street>.+(?=\bapt|\bunit)|.+(?=#)|.+)(?P<apt_unit>(?:\bapt|\bunit|#).+)?'
df.address.str.extract(pat)

[out]

  number          street  apt_unit
0    123   smith street      #5234
1   5000         john ct       NaN
2     34        wood st    apt# 23
3    523      fire road    apt #87
4    NaN    charles way   apt. 434
5   0987     misty lane     unit B



回答3:


If you want to use a regex, here's a way to do it that matches number, street, and apartment in separate named groups and then uses df.apply to populate new columns of the dataframe. Street name types are listed out brute force. Anything left over is the apartment number.

regex = '(?P<num>\d+)? ?(?P<street>\w+ (street|way|st|road|ct|lane)) ?(?P<apt>.+)?'

def get_num(address):
    match = re.match(regex, address)
    return match.group('num')


def get_street(address):
    match = re.match(regex, address)
    return match.group('street')


def get_apt(address):
    match = re.match(address)
    return match.group('apt')


df['num'] = df['address'].apply(get_num)
df['street'] = df['address'].apply(get_street)
df['apt'] = df['address'].apply(get_apt)



回答4:


If you already have removed all street numbers the only numbers left should be apartment numbers. Therefore you can use a simple regex to extract all numbers that are left in this column:

df['apt_number'] = df.address.str.extract(r'([\d]+)')

edit: With the above answer alone the numbers will remain in the addresses column, to remove we simply replace them with an empty string:

df['address'] = df.address.str.replace(r'([\d]+)', '')


来源:https://stackoverflow.com/questions/55105280/how-to-extract-apartment-from-address-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!