How to filter out positional data based on distance from a known reference trajectory?

问题

I have a 87288-point dataset that I need to filter. The filtering fields for the dataset are a X position and a Y position, as latitude and longitude. Plotted the data looks like this:

The problem is , I only need data along a certain path, which is known in advance. Something like this:

I already know how to filter data in a Pandas DF, but given the path is not linear, I need an effective strategy to clear out all the noisy data with a certain degree of precision (since the dataset is so large, manually picking the points is not an option).

Here is some sample data.The only important columns are Latitude and Longitude, Y and X respectively.

Sesion,Tiempo,Latitud,Longitud,PM2.5,Modo,Hora,DiaSemana
M-O-AM-07OCT19-DMR,2019-10-01 09:48:17.625,3.3659550000000005,-76.5288288,13.0,OUTDOOR,AM,1
M-O-AM-07OCT19-DMR,2019-10-07 08:18:03.555,3.3661757000000003,-76.5289441,12.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:04.596,3.3661757000000003,-76.5289441,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:05.572,3.3661767,-76.5289375,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:06.614,3.3661790999999996,-76.5289188,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:07.581,3.3661814,-76.5289024,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:08.588,3.3661847999999996,-76.52889820000001,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:09.570,3.3661922,-76.52890450000001,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:10.579,3.3661922,-76.52890450000001,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:11.577,3.3662135,-76.52893370000001,12.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:12.611,3.3662227999999996,-76.5289516,12.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:13.561,3.3662227999999996,-76.5289516,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:14.631,3.3662346,-76.5289927,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:15.554,3.3662421,-76.52901440000001,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:16.623,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:17.593,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:18.617,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:19.608,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:20.605,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:21.594,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:22.608,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:23.620,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:24.611,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:25.622,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:26.590,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:27.619,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:28.595,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:29.628,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:30.621,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0

I have tried of handpicking a few points inside the route, and filtering the rest using a fixed min distance, something like this.

import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
from cycler import cycler
import numpy as np
from salem import get_demo_file, DataLevels, GoogleVisibleMap, Map
import geopy.distance

def get_dist(coords_1 , coords_2):
    return geopy.distance.distance(coords_1, coords_2).meters

dists=[
    (-76.5297163,3.3665631),
    (-76.5307019,3.3656924),
    (-76.5314718,3.3646900),
    (-76.5319956,3.3638394),
    (-76.5316622,3.3621781),
    (-76.5311999,3.3611796),
    (-76.5308636,3.3599338),
    (-76.5306335,3.3585191),
    (-76.5304758,3.3577502),
    (-76.5303957,3.3561101),
    (-76.5302998,3.3543178),
    (-76.5302220,3.3531897),
    (-76.5302369,3.3515283),
    (-76.5303363,3.3502667),
    (-76.5305351,3.3485951),
    (-76.5306779,3.3475220),
    (-76.5308545,3.3456382),
    (-76.5307738,3.3446934),
    (-76.530618,3.3430422)
]
df = pd.read_csv('movil.csv')


for index, row in df.iterrows():
    if index%1000 ==0:
        print(index)
    mind=None
    for i in dists:
        if mind:
            d=get_dist((row['Latitud'],row['Longitud']),(i[1],i[0]))
            if d<mind:
                mind=d
        else:
            mind=get_dist((row['Latitud'],row['Longitud']),(i[1],i[0]))
    if mind>125:
        df.drop(index, inplace=True)

print(df)

Using these approach I managed to get some cleaning, but I feel a lot of useful data is getting filtered.

回答1:

Let's start with some sample data. Note that latitude and longitude are recorded in degrees for generation and plotting, but in radians for computation.

import numpy
import pandas

def add_radians(df):
    return df.assign(**{colname.rstrip("_deg"): numpy.radians(col) for colname, col in df.iteritems()})

n_ref = 26
ref_traj = pandas.DataFrame({"lat_deg": -76 + numpy.linspace(-1, 1, n_ref),
                             "lon_deg":   3 + numpy.linspace(-1, 1, n_ref)**2,
                            }).pipe(add_radians)

n = 500
traj = pandas.DataFrame({"lat_deg": -76 + numpy.cumsum(numpy.random.choice([-1, 1], size=n) * 0.05),
                         "lon_deg":   3 + numpy.cumsum(numpy.random.choice([-1, 1], size=n) * 0.05),
                        }).pipe(add_radians)

ax = traj.plot.scatter(x="lat_deg", y="lon_deg")
ax = ref_traj.plot.scatter(x="lat_deg", y="lon_deg", color="red", ax=ax)

Next, we can define a vectorized function returning the distance between two points. This should work for 1- or 2-dimensional arrays.

def distance(lat1, lon1, lat2, lon2):
    # TODO: check that shape of lat1, lon1, lat2, lon2 are all compatible.
    R = 6371  # Radius of Earth in kilometers

    # TODO: check this distance calculation

    def hav(theta):
        return numpy.sin(theta)**2

    d_lat = lat2 - lat1
    d_lon = lon2 - lon1
    a = hav(d_lat / 2) + numpy.cos(lat1) * numpy.cos(lat2) * hav(d_lon / 2)
    return 2 * R * numpy.sqrt(a)

Then, we can attempt to find the minimum distance from each trajectory point to any reference trajectory point. This is computationally expensive, at O(N*M), but we can vectorize it by broadcasting the reference points and trajectory points into 2-D arrays.

def min_distance(ref_lat, ref_lon, lat, lon):
    shape = (numpy.shape(lat)[0], numpy.shape(ref_lat)[0])

    def broadcasted(a):
        return numpy.broadcast_to(a, shape=shape)

    d = distance(lat1=broadcasted(ref_lat), 
                 lon1=broadcasted(ref_lon), 
                 lat2=broadcasted(lat[:, numpy.newaxis]),
                 lon2=broadcasted(lon[:, numpy.newaxis]))
    return numpy.amin(d, axis=-1)

Finally, we can choose a tolerance and filter points that have a minimum distance less than the tolerance.

d = min_distance(ref_traj['lat'], ref_traj['lon'], traj['lat'], traj['lon'])
tolerance = 10  # in kilometers
near_ref = d < tolerance

Finally, we can use the boolean near_ref mask to filter the traj dataframe:

ax = ref_traj.plot.scatter(x="lat_deg", y="lon_deg", color="red")
traj[near_ref].plot.scatter(x="lat_deg", y="lon_deg", color="blue", ax=ax)
traj[~near_ref].plot.scatter(x="lat_deg", y="lon_deg", color="gray", ax=ax)

来源：https://stackoverflow.com/questions/59009981/how-to-filter-out-positional-data-based-on-distance-from-a-known-reference-traje

标签

python

pandas

gps

data-science

data-cleaning