可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a function find_country_from_connection_ip which takes an ip, and after some processing returns a country. Like below:
def find_country_from_connection_ip(ip): # Do some processing return county
I am using the function inside apply method. like below:
df['Country'] = df.apply(lambda x: find_country_from_ip(x['IP']), axis=1)
As it is pretty straightforward, what I want is to evaluate a new column from an existing column in the DataFrame which has >400000 rows.
It runs, but terribly slow and throws an exception like below:
...........: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if name == 'main': In [38]:
I understand the problem, but can't quite figure out how to use loc with apply and lambda.
N.B. Please suggest if you have a more efficient alternative solution, which can bring the end result.
**** EDIT ********
The function is mainly a lookup on mmdb database like below:
def find_country_from_ip(ip): result = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read() if result: return re.search(r'\"(.+?)\"', result).group(1) else: final_output = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} registered_country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read() return re.search(r'\"(.+?)\"', final_output).group(1)
This is nevertheless a costly operation, and when you have a DataFrame with >400000 rows, it should take time. But how much? That is the question. It takes about 2 hours which is pretty much I think.
回答1:
I would use maxminddb-geolite2 (GeoLite) module for that.
First install maxminddb-geolite2 module
pip install maxminddb-geolite2
Python Code:
import pandas as pd from geolite2 import geolite2 def get_country(ip): try: x = geo.get(ip) except ValueError: return pd.np.nan try: return x['country']['names']['en'] if x else pd.np.nan except KeyError: return pd.np.nan geo = geolite2.reader() # it took me quite some time to find a free and large enough list of IPs ;) # IP's for testing: http://upd.emule-security.org/ipfilter.zip x = pd.read_csv(r'D:\download\ipfilter.zip', usecols=[0], sep='\s*\-\s*', header=None, names=['ip']) # get unique IPs unique_ips = x['ip'].unique() # make series out of it unique_ips = pd.Series(unique_ips, index = unique_ips) # map IP --> country x['country'] = x['ip'].map(unique_ips.apply(get_country)) geolite2.close()
Output:
In [90]: x Out[90]: ip country 0 000.000.000.000 NaN 1 001.002.004.000 NaN 2 001.002.008.000 NaN 3 001.009.096.105 NaN 4 001.009.102.251 NaN 5 001.009.106.186 NaN 6 001.016.000.000 NaN 7 001.055.241.140 NaN 8 001.093.021.147 NaN 9 001.179.136.040 NaN 10 001.179.138.224 Thailand 11 001.179.140.200 Thailand 12 001.179.146.052 NaN 13 001.179.147.002 Thailand 14 001.179.153.216 Thailand 15 001.179.164.124 Thailand 16 001.179.167.188 Thailand 17 001.186.188.000 NaN 18 001.202.096.052 NaN 19 001.204.179.141 China 20 002.051.000.165 NaN 21 002.056.000.000 NaN 22 002.095.041.202 NaN 23 002.135.237.106 Kazakhstan 24 002.135.237.250 Kazakhstan ... ... ...
Timing: for 171.884 unique IPs:
In [85]: %timeit unique_ips.apply(get_country) 1 loop, best of 3: 14.8 s per loop In [86]: unique_ips.shape Out[86]: (171884,)
Conclusion: it would take approx. 35 seconds for you DF with 400K unique IPs on my hardware:
In [93]: 400000/171884*15 Out[93]: 34.90726303786274
回答2:
IIUC you can use your custom function with Series.apply this way:
df['Country'] = df['IP'].apply(find_country_from_ip)
Sample:
df = pd.DataFrame({'IP':[1,2,3], 'B':[4,5,6]}) def find_country_from_ip(ip): # Do some processing # some testing formula county = ip + 5 return county df['Country'] = df['IP'].apply(find_country_from_ip) print (df) B IP Country 0 4 1 6 1 5 2 7 2 6 3 8
回答3:
Your issue isn't with how to use apply or loc. The issue is that your df is flagged as a copy of another dataframe.
Let's explore this a bit
df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz'))) df

def find_country_from_connection_ip(ip): return {1: 'A', 2: 'B', 3: 'C'}[ip] df['Country'] = df.IP.apply(find_country_from_connection_ip) df

No Problems
Let's make some problems
# This should make a copy print(bool(df.is_copy)) df = df[['A', 'IP']] print(df) print(bool(df.is_copy)) False A IP 0 x 1 1 y 2 2 z 3 True
Perfect, now we have a copy. Let's perform the same assignment with the apply
df['Country'] = df.IP.apply(find_country_from_connection_ip) df
//anaconda/envs/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == '__main__':

how do you fix it?
Where ever you created df you can use df.loc. My example above, where I did df = df[:] triggered the copy. If I had used loc instead, I'd have avoided this mess.
print(bool(df.is_copy)) df = df.loc[:] print(df) print(bool(df.is_copy)) False A IP 0 x 1 1 y 2 2 z 3 False
You need to either find where df is created and use loc or iloc instead when you slice the source dataframe. Or, you can simply do this...
df.is_copy = None
The full demonstration
df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz'))) def find_country_from_connection_ip(ip): return {1: 'A', 2: 'B', 3: 'C'}[ip] df = df[:] df.is_copy = None df['Country'] = df.IP.apply(find_country_from_connection_ip) df
