segmenting or grouping a df based on parameters or differences within columns going down the dataframe rows?

本秂侑毒 提交于 2020-01-15 05:27:04

问题


I was trying to figure out if there was a way in which where I had a dataframe with multiple fields and I wanted to segment or group the dataframe into a new dataframe based on if the values of specific columns were within x amount of each other?

   I.D  |      Created_Time            | Home_Longitude | Home_Latitude | Work_Longitude | Home_Latitude
  Faa1      2019-02-23 20:01:13.362           -77.0364            38.8951    -72.0364      38.8951

Above is how the original df looks with multiple rows. I want to create a new dataframe where all rows or I.Ds contain created times that are within x amount of minutes of each other, and using haversine within x miles of one another homes, and x miles within one another work.

So Basically trying to filter this dataframe into a df that only contains rows that are within x minutes of order created time, x miles within one another homes and , x miles within each work column value.


回答1:


I did this by

  1. calculating the distances (in miles) and time relative to the first row
    • My logic
      • if n rows are within x minutes/miles of the first row, then those n rows are within x minutes/miles of each other
  2. filter the data using the required distance and time filter conditions

Generate some dummy data

  • random co-ordinates
# Generate random Lat-Long points
def newpoint():
   return uniform(-180,180), uniform(-90, 90)
home_points = (newpoint() for x in range(289))
work_points = (newpoint() for x in range(289))

df = pd.DataFrame(home_points, columns=['Home_Longitude', 'Home_Latitude'])
df[['Work_Longitude', 'Work_Latitude']] = pd.DataFrame(work_points)

# Insert `ID` column as sequence of integers
df.insert(0, 'ID', range(289))

# Generate random datetimes, separated by 5 minute intervals
# (you can choose your own interval)
times = pd.date_range('2012-10-01', periods=289, freq='5min')
df.insert(1, 'Created_Time', times)
print(df.head())

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude
0   0 2012-10-01 00:00:00      -48.885981     -39.412351      -68.756244      24.739860
1   1 2012-10-01 00:05:00       58.584893      59.851739     -119.978429     -87.687858
2   2 2012-10-01 00:10:00      -18.623484      85.435248      -14.204142      -3.693993
3   3 2012-10-01 00:15:00      -29.721788      71.671103      -69.833253     -12.446204
4   4 2012-10-01 00:20:00      168.257968     -13.247833       60.979050     -18.393925

Create Python helper function with haversine distance formula (vectorized haversine distance formula, in km)

def haversine(lat1, lon1, lat2, lon2, to_radians=False, earth_radius=6371):
    """
    slightly modified version: of http://stackoverflow.com/a/29546836/2901002

    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees or in radians)

    All (lat, lon) coordinates must have numeric dtypes and be of equal length.

    """
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

Calculate distances (relative to first row) in km, using haversine formula. Then, convert km to miles

df['Home_dist_miles'] = \
    haversine(df.Home_Longitude, df.Home_Latitude,
                 df.loc[0, 'Home_Longitude'], df.loc[0, 'Home_Latitude'])*0.621371
df['Work_dist_miles'] = \
    haversine(df.Work_Longitude, df.Work_Latitude,
                 df.loc[0, 'Work_Longitude'], df.loc[0, 'Work_Latitude'])*0.621371

Calculate time differences, in minutes (relative to first row)

  • for the dummy data here, the time differences will be in multiples of 5 minutes (but in real data, they could be anything)
df['time'] = df['Created_Time'] - df.loc[0, 'Created_Time']
df['time_min'] = (df['time'].dt.days * 24 * 60 * 60 + df['time'].dt.seconds)/60

Apply filters (method 1) and then select any 2 rows that satisfy the conditions stated in the OP

home_filter = df['Home_dist_miles']<=12000 # within 12,000 miles
work_filter = df['Work_dist_miles']<=8000 # within 8,000 miles
time_filter = df['time_min']<=25 # within 25 minutes
df_filtered = df.loc[(home_filter) & (work_filter) & (time_filter)]

# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude  Home_dist_miles  Work_dist_miles     time  time_min
0   0 2012-10-01 00:00:00     -168.956448     -42.970705       -6.340945     -12.749469         0.000000         0.000000 00:00:00       0.0
4   4 2012-10-01 00:20:00      -73.120352      13.748187      -36.953587      23.528789      6259.078588      5939.425019 00:20:00      20.0

Apply filters (method 2) and then select any 2 rows that satisfy the conditions stated in the OP

multi_query = """Home_dist_miles<=12000 & \
                Work_dist_miles<=8000 & \
                time_min<=25"""
df_filtered = df.query(multi_query)

# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude  Home_dist_miles  Work_dist_miles     time  time_min
0   0 2012-10-01 00:00:00     -168.956448     -42.970705       -6.340945     -12.749469         0.000000         0.000000 00:00:00       0.0
4   4 2012-10-01 00:20:00      -73.120352      13.748187      -36.953587      23.528789      6259.078588      5939.425019 00:20:00      20.0


来源:https://stackoverflow.com/questions/55034820/segmenting-or-grouping-a-df-based-on-parameters-or-differences-within-columns-go

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!