Joining Spark DataFrames on a nearest key condition

问题

What’s a performant way to do fuzzy joins in PySpark?

I am looking for the community's views on a scalable approach to joining large Spark DataFrames on a nearest key condition. Allow me to illustrate this problem by means of a representative example. Suppose we have the following Spark DataFrame containing events occurring at some point in time:

ddf_event = spark.createDataFrame(
    data=[
        [1, 'A'],
        [5, 'A'],
        [10, 'B'],
        [15, 'A'],
        [20, 'B'],
        [25, 'B'],
        [30, 'A']
    ],
    schema=['ts_event', 'event']
)

and the following Spark DataFrame containing GPS data measured at some point in time:

ddf_gps = spark.createDataFrame(
    data=[
        [2, '(-46.84635, 173.13674)'],
        [4, '(2.50362, 104.34136)'],
        [8, '(-24.20741, 51.80755)'],
        [15, '(-59.07798, -20.49141)'],
        [18, '(-44.34468, -167.90401)'],
        [24, '(-18.84175, 16.68628)'],
        [27, '(20.48501,58.42423)']
    ],
    schema=['ts_gps', 'gps_coordinates']
)

which we would like to join to produce the following resulting DataFrame:

+--------+-----+------+-----------------------+
|ts_event|event|ts_gps|gps_coordinates        |
+--------+-----+------+-----------------------+
|1       |A    |2     |(-46.84635, 173.13674) |
|5       |A    |4     |(2.50362, 104.34136)   |
|10      |B    |8     |(-24.20741, 51.80755)  |
|15      |A    |15    |(-59.07798, -20.49141) |
|20      |B    |18    |(-44.34468, -167.90401)|
|25      |B    |24    |(-18.84175, 16.68628)  |
|30      |A    |27    |(20.48501,58.42423)    |
+--------+-----+------+-----------------------+

effectively finding the nearest GPS data point given the event timestamp and GPS data timestamp.

We thus run into the problem of joining on a nearest key condition, 'nearest' in this case being defined as the smallest absolute difference between timestamps.

I've explored two approaches to achieve this: one based on a filtered binned join (FBJ) and one based on a filtered sorted union (FSU). Both approaches are described below in more detail.

The FBJ approach depends on the parameter bin_size, which limits the time window in which a matching GPS timestamp may be found. Increasing the bin_size increases the computational load, decreasing it decreases the outcome quality.

Both approaches do not appear to scale linearly with the size of the input DataFrames.

In practice I have to deal with input data consisting of tens of millions of rows, therefore I am currently lost for a viable solution to the problem.

FBJ approach

The FBJ approach consists of the following steps:

Create a ts_bin column, binning the timestamp columns, implemented by:

bin_size = 10
ddf_event = ddf_event.withColumn(
    'ts_bin',
    F.round(F.col('ts_event') / bin_size)
)

ddf_gps = ddf_gps.withColumn(
    'ts_bin',
    F.round(F.col('ts_gps') / bin_size)
)

Join the DataFrames on the ts_bin column, implemented by:

ddf = ddf_event.join(ddf_gps, 'ts_bin', 'left_outer')

Determine the minimum timestamp difference, implemented by:

from pyspark.sql.window import Window

window = Window.partitionBy('ts_event')

ddf = ddf.withColumn(
    'ts_diff',
    F.abs(F.col('ts_gps') - F.col('ts_event'))
)

ddf = ddf.withColumn(
    'min_ts_diff',
    F.min(F.col('ts_diff')).over(window)
)

Filter and select the relevant rows and columns, implemented by:

ddf = (
    ddf
    .where(
        (F.col('ts_diff') == F.col('min_ts_diff')) |
        (F.col('ts_diff').isNull())   
    )
    .select(
        'ts_event',
        'event',
        'ts_gps',
        'gps_coordinates'
    )
)

Limit bin_size situations:

bin_size >> 1 effectively results in a full cross-join
bin_size = 1 effectively results in a left-join on ts_event == ts_gps

FSU approach

The FSU approach consists of the following steps:

Union the DataFrames, implemented by:

def union(df1, df2):
    cols = list(set(df1.columns).union(set(df2.columns)))
    for col in cols:
        if col not in df1.columns:
            df1 = df1.withColumn(col, F.lit(None))
        if col not in df2.columns:
            df2 = df2.withColumn(col, F.lit(None))
    return df1.select(cols).union(df2.select(cols))

ddf_event = ddf_event.withColumn('timestamp', F.col('ts_event'))
ddf_gps = ddf_gps.withColumn('timestamp', F.col('ts_gps'))
ddf = union(ddf_event, ddf_gps)

Sort the resulting DataFrame and get the adjecent GPS timestamps, implemented by:

from sys import maxsize

last_window = Window.orderBy(
    F.col('timestamp').asc()).rowsBetween(-maxsize, 0)
first_window = Window.orderBy(
    F.col('timestamp').asc()).rowsBetween(0, maxsize)

ddf = (
    ddf.withColumn(
        'prev_time',
        F.last(F.col('ts_gps'), ignorenulls=True)
         .over(last_window)
    ).withColumn(
        'prev_coordinates',
        F.last(F.col('gps_coordinates'), ignorenulls=True)
         .over(last_window)
    ).withColumn(
        'next_time',
        F.first(F.col('ts_gps'), ignorenulls=True)
         .over(first_window)
    ).withColumn(
        'next_coordinates',
        F.first(F.col('gps_coordinates'), ignorenulls=True)
         .over(first_window)
    )
)

Filter and select the relevant rows and columns, implemented by:

condition = (F.col('timestamp') - F.col('prev_time')
             < F.col('next_time') - F.col('timestamp'))

ddf = (
    ddf
    .where(F.col('event').isNotNull())
    .withColumn(
        'ts_gps',
        F.when(condition | F.col('next_time').isNull(), F.col('prev_time')).otherwise(F.col('next_time'))
    ).withColumn(
        'gps_coordinates',
        F.when(condition | F.col('next_time').isNull(),
               F.col('prev_coordinates'))
         .otherwise(F.col('next_coordinates'))
    ).select(
        'ts_event',
        'event',
        'ts_gps',
        'gps_coordinates'
    )
)

回答1:

What you are looking for is a temporal join. Check out the time series Spark library Flint (formerly HuoHua, Spark in Chinese): https://github.com/twosigma/flint

Using this library, for 2 given Time Series DataFrames (the documentation explains these objects), you can perform in PySpark (or Scala Spark):

ddf_event = ...
ddf_gps = ...
result = ddf_event.leftJoin(ddf_gps, tolerance = "1day")

Your timestamps were not clear, so set tolerance according to your needs. You can also do 'future joins' if needed.

Check out their Spark Summit presentation for more explanation and examples: https://youtu.be/g8o5-2lLcvQ

来源：https://stackoverflow.com/questions/58077181/joining-spark-dataframes-on-a-nearest-key-condition

标签

python

performance

dataframe

apache-spark

join