Pyspark read delta/upsert dataset from csv files

百般思念 提交于 2019-12-03 23:18:04

You could

from pyspark.sql.functions import * 
alls = spark.read.csv("files/*").withColumn('filename', input_file_name())

Which will load all the files in the directory and allow you to operate on column with filename.

I assume that filename has some sort of timestamp or key on which You can differentiate and order them with window and row_number function.

Amplifying on @pandaromeo's answer, this seems to work...

from pyspark.sql import Window
from pyspark.sql.functions import row_number, desc, input_file_name


# load files, marking each with input file name
df = spark.read.csv(files).withColumn("_ifn", input_file_name())

# use a window function to order the rows for each ID by file name (most recent first)
w = Window.partitionBy(primaryKey).orderBy(desc('_ifn'))
df = df.withColumn("_rn", row_number().over(w))

# grab only the rows that were first (most recent) in each window
# clean up working columns
df = df.where(df._rn == 1).drop("_rn").drop("_ifn")
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!