Spark find max of date partitioned column

问题

I have a parquet partitioned in the following way:

data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24

Here batch_date which is the partition column is of date type.

I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.

I could use a simple group by something like

df.groupby().agg(max(col('batch_date'))).first()

While this would work it's a very inefficient way since it involves a groupby.

I want to know if we can query the latest partition in a more efficient way.

Thanks.

回答1:

Function "max" can be used without "groupBy":

df.select(max("batch_date"))

回答2:

Doing the method suggested by @pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that. One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.

Assuming you are using AWS s3 format, something like this:

import sys
import s3fs

datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
    date=paths.split('=')[1]
    datelist.append(date)
maxpart=max(datelist)

df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)

This would do all the work in lists without loading anything into memory until it finds the one you want to load.

回答3:

Using Show partitions to get all partition of table

show partitions TABLENAME

Output will be like

pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1

we can get data form specific partition using below query

select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;

Or additional filter or group by can be applied on it.

来源：https://stackoverflow.com/questions/61818650/spark-find-max-of-date-partitioned-column

标签

apache-spark

pyspark