问题
I have a dataset that is updated periodically, that I receive as a series of CSV files giving the changes. I'd like a Dataframe that contains only the latest version of each row. Is there a way to load the whole dataset in Spark/pyspark that allows for parallelism?
Example:
- File 1 (Key, Value)
1,ABC 2,DEF 3,GHI
- File 2 (Key, Value)
2,XYZ 4,UVW
- File 3 (Key, Value)
3,JKL 4,MNO
Should result in:
1,ABC
2,XYZ
3,JKL
4,MNO
I know I could do this by loading each file sequentially and then using an anti join (to kick out old values being replaced) and a union, but that doesn't let the workload be parallel.
回答1:
You could
from pyspark.sql.functions import *
alls = spark.read.csv("files/*").withColumn('filename', input_file_name())
Which will load all the files in the directory and allow you to operate on column with filename.
I assume that filename has some sort of timestamp or key on which You can differentiate and order them with window and row_number function.
回答2:
Amplifying on @pandaromeo's answer, this seems to work...
from pyspark.sql import Window
from pyspark.sql.functions import row_number, desc, input_file_name
# load files, marking each with input file name
df = spark.read.csv(files).withColumn("_ifn", input_file_name())
# use a window function to order the rows for each ID by file name (most recent first)
w = Window.partitionBy(primaryKey).orderBy(desc('_ifn'))
df = df.withColumn("_rn", row_number().over(w))
# grab only the rows that were first (most recent) in each window
# clean up working columns
df = df.where(df._rn == 1).drop("_rn").drop("_ifn")
来源:https://stackoverflow.com/questions/44809071/pyspark-read-delta-upsert-dataset-from-csv-files