Joining Spark DataFrames on a nearest key condition
问题 What’s a performant way to do fuzzy joins in PySpark? I am looking for the community's views on a scalable approach to joining large Spark DataFrames on a nearest key condition. Allow me to illustrate this problem by means of a representative example. Suppose we have the following Spark DataFrame containing events occurring at some point in time: ddf_event = spark.createDataFrame( data=[ [1, 'A'], [5, 'A'], [10, 'B'], [15, 'A'], [20, 'B'], [25, 'B'], [30, 'A'] ], schema=['ts_event', 'event']