问题
How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?
I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".
I've tried doing:
import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
data['Address Difference']
but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).
I've also tried:
data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')
but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.
Any help would be appreciated.
Thanks
回答1:
Using replace
with regex
data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")
回答2:
I'd use a function that we can map across inputs. This should be fast.
The function will use str.find
to see if the other string is a subset. If the result of str.find
is -1
then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.
def rm(x, y):
i = x.find(y)
if i > -1:
j = len(y)
return x[:i] + x[i+j:]
else:
return x
df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]
df
BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
回答3:
You can replace the bad address part from good address
df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()
Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
来源:https://stackoverflow.com/questions/53288887/how-do-i-create-a-new-column-in-pandas-from-the-difference-of-two-string-columns