Spark find max of date partitioned column

情到浓时终转凉″ 提交于 2020-12-31 06:24:44

问题


I have a parquet partitioned in the following way:

data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24

Here batch_date which is the partition column is of date type.

I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.

I could use a simple group by something like

df.groupby().agg(max(col('batch_date'))).first()

While this would work it's a very inefficient way since it involves a groupby.

I want to know if we can query the latest partition in a more efficient way.

Thanks.


回答1:


Function "max" can be used without "groupBy":

df.select(max("batch_date"))



回答2:


Doing the method suggested by @pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that. One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.

Assuming you are using AWS s3 format, something like this:

import sys
import s3fs

datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
    date=paths.split('=')[1]
    datelist.append(date)
maxpart=max(datelist)

df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)

This would do all the work in lists without loading anything into memory until it finds the one you want to load.




回答3:


Using Show partitions to get all partition of table

show partitions TABLENAME

Output will be like

pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1

we can get data form specific partition using below query

select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;

Or additional filter or group by can be applied on it.



来源:https://stackoverflow.com/questions/61818650/spark-find-max-of-date-partitioned-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!