Python Pandas removing substring using another column

本秂侑毒 提交于 2019-11-29 06:48:14

Here is one solution that is quite a bit faster than your current solution, I'm not convinced that there wouldn't be something faster though

In [13]: import numpy as np
         import pandas as pd
         n = 1000
         testing  = pd.DataFrame({'NAME':[
         'FIRST', np.nan, 'NAME2', 'NAME3', 
         'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST  LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})

This is kind of a long one liner but it should do what you need

Fasted solution I can come up with is using replace as mentioned in another answer:

In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 4.67 ms per loop

Original answer:

In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 7.24 ms per loop

compared to your current solution:

In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1)
10 loops, best of 3: 166 ms per loop

These get you the same answer as your current solution

You could do it with replace method and regex argument and then use str.strip:

In [605]: testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
Out[605]: 
0            LAST
1             NaN
2      FIRST LAST
3           FIRST
4     FIRST  LAST
5    ANOTHER NAME
6       LAST NAME
Name: FULL_NAME, dtype: object

Note You need to pass notnull to testing.NAME because without it NaN values also will be replaced to empty string

Benchmarking is slower then fastest @johnchase solution but I think it's more readable and use all pandas methods of DataFrames and Series:

In [607]: %timeit testing['NEW'] = testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
100 loops, best of 3: 4.56 ms per loop

In [661]: %timeit testing ['NEW'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
1000 loops, best of 3: 450 µs per loop

I think you want to use the replace() method that strings have, it's orders of magnitude faster than using regular expressions (I just checked quickly in IPython):

%timeit mystr.replace("ello", "")
The slowest run took 7.64 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 250 ns per loop

%timeit re.sub("ello","", "e")
The slowest run took 21.03 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 4.7 µs per loop

If you need further speed improvements after that, you should look into numpy's vectorize function (but I think the speed up from using replace instead of regular expressions should be pretty substantial).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!