The sample CSV is like this:
user_id lat lon
1 19.111841 72.910729
1 19.111342 72.908387
2 19.111542 72.907387
2 19.1
Try this approach:
import pandas as pd
import numpy as np
# parse CSV to DataFrame. You may want to specify the separator (`sep='...'`)
df = pd.read_csv('/path/to/file.csv')
# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
Now we can calculate distances between coordinates belonging to the same id (group):
df['dist'] = \
np.concatenate(df.groupby('id')
.apply(lambda x: haversine(x['lat'], x['lon'],
x['lat'].shift(), x['lon'].shift())).values)
Result:
In [105]: df
Out[105]:
id lat lon dist
0 1 19.111841 72.910729 NaN
1 1 19.111342 72.908387 0.252243
2 2 19.111542 72.907387 NaN
3 2 19.137815 72.914085 3.004976
4 2 19.119677 72.905081 2.227658
5 2 19.129677 72.905081 1.111949
6 3 19.319677 72.905081 NaN
7 3 19.120217 72.907121 22.179974
8 4 19.420217 72.807121 NaN
9 4 19.520217 73.307121 53.584504
10 5 19.319677 72.905081 NaN
11 5 19.419677 72.805081 15.286775
12 5 19.629677 72.705081 25.594890
13 5 19.111860 72.911347 61.509917
14 5 19.111860 72.931346 2.101215
15 5 19.219677 72.605081 36.304756
16 6 19.319677 72.805082 NaN
17 6 19.419677 72.905086 15.287063