Regex syntax for replacing multiple strings: where have I gone wrong?

廉价感情. 提交于 2021-01-29 05:57:53

问题


I have a dataframe with the column 'purpose' that has a lot of string values that I want to standardize by finding a string and replacing it.

For instance, some very similar values are car purchase, buying a second-hand car, buying my own car, cars, second-hand car purchase, car, to own a car, purchase of a car, to buy a car

I used the following to make this change:

#replace anything to do with buying a car with "Vehicle"

credit_data['purpose'] = credit_data.purpose.str.replace(r'(^.*car.*$)','Vehicle')

and it worked great, all of those values were replaced with 'Vehicle'

I have a number of other similar strings in this column for other types, like education - supplementary education, education, getting an education, to get a supplementary education, university education, etc.

so, I looked up regex syntax and came up with the following:

#replace anything to do with education with "Education"

credit_data['purpose'] = credit_data.purpose.str.replace(r'(^.*education|university|educated.*$)','Education')

the results for this are similar to above - everything says education now - yay!

which brings me to my question - I've gone wrong somewhere in applying this to some of my other strings - for instance, I used a similar method for real estate:

#replace anything to do with real estate with real estate

credit_data['purpose'] = credit_data.purpose.str.replace(r'(^.*real estate|housing|house|property.*$)','Real Estate')

and my results here are different - I started with values like purchase my own house, building a house, purchase of a property, etc. and all the method seems to have done was replace just the string i identified, instead of the entire string with just the replacement string.

so instead of having a bunch of entries that say "Real Estate" I have a bunch of entries that say purchase my own Real Estate, building a Real Estate, purchase of a Real Estate, etc.

I'm not sure where I've gone wrong?

Thanks in advance.

edited to add requested series from the dataframe:

Df = [purchase of the house, car purchase, supplementary education, to have a wedding, housing, transactions, education, having a wedding, purchase of the house for my family, buy real estate, buy commercial real estate, buy residential real estate, construction of own property, property, building a property, buying a second-hand car, buying my own car, transactions with commercial real estate, building a real estate, housing, transactions with my real estate, cars, to become educated, second-hand car purchase, getting an education, car, wedding ceremony, to get a supplementary education, purchase of my own house, real estate transactions, getting higher education, to own a car, purchase of a car, profile education, university education, buying property for renting out, to buy a car, housing renovation, going to university]


回答1:


You are making the regular expression too restrictive and using the wrong character for alternation. You can use \b to match a word boundary, | to match multiple patterns and IGNORECASE to cover case issues. So for example

credit_data.purpose.str.replace(r'\b(real estate|housing|house|property)\b',
    'Real Estate', regex=True, flags=re.IGNORECASE)

If you want to replace the entire string, you can use dot-all (.*).

credit_data.purpose.str.replace(r'.*(real estate|housing|house|property).*',
    'Real Estate', regex=True, flags=re.IGNORECASE)


来源:https://stackoverflow.com/questions/64830587/regex-syntax-for-replacing-multiple-strings-where-have-i-gone-wrong

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!