问题
I've tried searching around and can't figure out an easy way to do this, so I'm hoping your expertise can help.
I have a pandas data frame with two columns
import numpy as np
import pandas as pd
pd.options.display.width = 1000
testing = pd.DataFrame({'NAME':[
'FIRST', np.nan, 'NAME2', 'NAME3',
'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']})
which gives me
FULL_NAME NAME
0 FIRST LAST FIRST
1 NaN NaN
2 FIRST LAST NAME2
3 FIRST NAME3 NAME3
4 FIRST NAME4 LAST NAME4
5 ANOTHER NAME NAME5
6 LAST NAME NAME6
what I'd like to do is take the values from the 'NAME' column and remove then from the 'FULL NAME' column if it's there. So the function would then return
FULL_NAME NAME NEW
0 FIRST LAST FIRST LAST
1 NaN NaN NaN
2 FIRST LAST NAME2 FIRST LAST
3 FIRST NAME3 NAME3 FIRST
4 FIRST NAME4 LAST NAME4 FIRST LAST
5 ANOTHER NAME NAME5 ANOTHER NAME
6 LAST NAME NAME6 LAST NAME
So far, I've defined a function below and am using the apply method. This runs rather slow on my large data set though and I'm hoping there's a more efficient way to do it. Thanks!
def address_remove(x):
try:
newADDR1 = re.sub(x['NAME'], '', x[-1])
newADDR1 = newADDR1.rstrip()
newADDR1 = newADDR1.lstrip()
return newADDR1
except:
return x[-1]
回答1:
Here is one solution that is quite a bit faster than your current solution, I'm not convinced that there wouldn't be something faster though
In [13]: import numpy as np
import pandas as pd
n = 1000
testing = pd.DataFrame({'NAME':[
'FIRST', np.nan, 'NAME2', 'NAME3',
'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})
This is kind of a long one liner but it should do what you need
Fasted solution I can come up with is using replace
as mentioned in another answer:
In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 4.67 ms per loop
Original answer:
In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 7.24 ms per loop
compared to your current solution:
In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1)
10 loops, best of 3: 166 ms per loop
These get you the same answer as your current solution
回答2:
You could do it with replace method and regex
argument and then use str.strip
:
In [605]: testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
Out[605]:
0 LAST
1 NaN
2 FIRST LAST
3 FIRST
4 FIRST LAST
5 ANOTHER NAME
6 LAST NAME
Name: FULL_NAME, dtype: object
Note You need to pass notnull
to testing.NAME
because without it NaN
values also will be replaced to empty string
Benchmarking is slower then fastest @johnchase solution but I think it's more readable and use all pandas methods of DataFrames and Series:
In [607]: %timeit testing['NEW'] = testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
100 loops, best of 3: 4.56 ms per loop
In [661]: %timeit testing ['NEW'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
1000 loops, best of 3: 450 µs per loop
回答3:
I think you want to use the replace() method that strings have, it's orders of magnitude faster than using regular expressions (I just checked quickly in IPython):
%timeit mystr.replace("ello", "")
The slowest run took 7.64 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 250 ns per loop
%timeit re.sub("ello","", "e")
The slowest run took 21.03 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 4.7 µs per loop
If you need further speed improvements after that, you should look into numpy's vectorize function (but I think the speed up from using replace instead of regular expressions should be pretty substantial).
来源:https://stackoverflow.com/questions/34773317/python-pandas-removing-substring-using-another-column