Multiple spark jobs appending parquet data to same base path with partitioning

后端 未结 4 869
粉色の甜心
粉色の甜心 2020-12-08 01:07

I have multiple jobs that I want to execute in parallel that append daily data into the same path using partitioning.

e.g.

dataFrame.write().
             


        
4条回答
  •  执笔经年
    2020-12-08 01:26

    Instead of using partitionBy

    dataFrame.write().
             partitionBy("eventDate", "category")
                .mode(Append)
                .parquet("s3://bucket/save/path");
    

    Alternatively you can write the files as

    In job-1 specify the parquet file path as :

    dataFrame.write().mode(Append)            
    .parquet("s3://bucket/save/path/eventDate=20160101/channel=billing_events")
    

    & in job-2 specify the parquet file path as :

    dataFrame.write().mode(Append)            
    .parquet("s3://bucket/save/path/eventDate=20160101/channel=click_events")
    
    1. Both jobs will create seperate _temporary directory under the respective folder so concurrency issue is solved.
    2. And partition discovery will also happen as eventDate=20160101 and for channel column.
    3. Disadvantage - even if channel=click_events do not exists in data still parquet file for the channel=click_events will be created.

提交回复
热议问题