Pandas: fastest way to resolve IP to country

匿名 (未验证) 提交于 2019-12-03 08:57:35

问题:

I have a function find_country_from_connection_ip which takes an ip, and after some processing returns a country. Like below:

def find_country_from_connection_ip(ip):     # Do some processing     return county 

I am using the function inside apply method. like below:

df['Country'] = df.apply(lambda x: find_country_from_ip(x['IP']), axis=1) 

As it is pretty straightforward, what I want is to evaluate a new column from an existing column in the DataFrame which has >400000 rows.

It runs, but terribly slow and throws an exception like below:

...........: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

if name == 'main': In [38]:

I understand the problem, but can't quite figure out how to use loc with apply and lambda.

N.B. Please suggest if you have a more efficient alternative solution, which can bring the end result.

**** EDIT ********

The function is mainly a lookup on mmdb database like below:

def find_country_from_ip(ip):     result = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()     if result:         return re.search(r'\"(.+?)\"', result).group(1)      else:         final_output = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} registered_country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()         return re.search(r'\"(.+?)\"', final_output).group(1) 

This is nevertheless a costly operation, and when you have a DataFrame with >400000 rows, it should take time. But how much? That is the question. It takes about 2 hours which is pretty much I think.

回答1:

I would use maxminddb-geolite2 (GeoLite) module for that.

First install maxminddb-geolite2 module

pip install maxminddb-geolite2 

Python Code:

import pandas as pd from geolite2 import geolite2  def get_country(ip):     try:         x = geo.get(ip)     except ValueError:         return pd.np.nan     try:         return x['country']['names']['en'] if x else pd.np.nan     except KeyError:         return pd.np.nan  geo = geolite2.reader()  # it took me quite some time to find a free and large enough list of IPs ;) # IP's for testing: http://upd.emule-security.org/ipfilter.zip x = pd.read_csv(r'D:\download\ipfilter.zip',                 usecols=[0], sep='\s*\-\s*',                 header=None, names=['ip'])  # get unique IPs unique_ips = x['ip'].unique() # make series out of it unique_ips = pd.Series(unique_ips, index = unique_ips) # map IP --> country x['country'] = x['ip'].map(unique_ips.apply(get_country))  geolite2.close() 

Output:

In [90]: x Out[90]:                      ip     country 0       000.000.000.000         NaN 1       001.002.004.000         NaN 2       001.002.008.000         NaN 3       001.009.096.105         NaN 4       001.009.102.251         NaN 5       001.009.106.186         NaN 6       001.016.000.000         NaN 7       001.055.241.140         NaN 8       001.093.021.147         NaN 9       001.179.136.040         NaN 10      001.179.138.224    Thailand 11      001.179.140.200    Thailand 12      001.179.146.052         NaN 13      001.179.147.002    Thailand 14      001.179.153.216    Thailand 15      001.179.164.124    Thailand 16      001.179.167.188    Thailand 17      001.186.188.000         NaN 18      001.202.096.052         NaN 19      001.204.179.141       China 20      002.051.000.165         NaN 21      002.056.000.000         NaN 22      002.095.041.202         NaN 23      002.135.237.106  Kazakhstan 24      002.135.237.250  Kazakhstan ...                 ...         ... 

Timing: for 171.884 unique IPs:

In [85]: %timeit unique_ips.apply(get_country) 1 loop, best of 3: 14.8 s per loop  In [86]: unique_ips.shape Out[86]: (171884,) 

Conclusion: it would take approx. 35 seconds for you DF with 400K unique IPs on my hardware:

In [93]: 400000/171884*15 Out[93]: 34.90726303786274 


回答2:

IIUC you can use your custom function with Series.apply this way:

df['Country'] = df['IP'].apply(find_country_from_ip) 

Sample:

df = pd.DataFrame({'IP':[1,2,3],                    'B':[4,5,6]})   def find_country_from_ip(ip):     # Do some processing      # some testing formula     county = ip + 5     return county  df['Country'] = df['IP'].apply(find_country_from_ip)  print (df)    B  IP  Country 0  4   1        6 1  5   2        7 2  6   3        8 


回答3:

Your issue isn't with how to use apply or loc. The issue is that your df is flagged as a copy of another dataframe.

Let's explore this a bit

df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz'))) df 

def find_country_from_connection_ip(ip):     return {1: 'A', 2: 'B', 3: 'C'}[ip]  df['Country'] = df.IP.apply(find_country_from_connection_ip) df 

No Problems
Let's make some problems

# This should make a copy print(bool(df.is_copy)) df = df[['A', 'IP']] print(df) print(bool(df.is_copy))  False    A  IP 0  x   1 1  y   2 2  z   3 True 

Perfect, now we have a copy. Let's perform the same assignment with the apply

df['Country'] = df.IP.apply(find_country_from_connection_ip) df 
//anaconda/envs/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning:  A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead  See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy   if __name__ == '__main__': 


how do you fix it?
Where ever you created df you can use df.loc. My example above, where I did df = df[:] triggered the copy. If I had used loc instead, I'd have avoided this mess.

print(bool(df.is_copy)) df = df.loc[:] print(df) print(bool(df.is_copy))  False    A  IP 0  x   1 1  y   2 2  z   3 False 

You need to either find where df is created and use loc or iloc instead when you slice the source dataframe. Or, you can simply do this...

df.is_copy = None 

The full demonstration

df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz')))  def find_country_from_connection_ip(ip):     return {1: 'A', 2: 'B', 3: 'C'}[ip]  df = df[:]  df.is_copy = None  df['Country'] = df.IP.apply(find_country_from_connection_ip) df 



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!