Why is partition key column missing from DataFrame

后端 未结 2 1539
萌比男神i
萌比男神i 2020-12-11 07:56

I have a job which loads a DataFrame object and then saves the data to parquet format using the DataFrame partitionBy method. Then I publish the paths created s

2条回答
  •  被撕碎了的回忆
    2020-12-11 08:36

    If the other application is loading specific partition, which it looks like from load("hdfs://localhost:9000/ptest/id=0/") path, that application can tweak code to replace null with partition column value

    part = 0 # partition to load 
    df2 =spark.read.format("parquet")\
                   .schema(df.schema)\
                   .load("ptest/id="+str(part)).fillna(part,["id"])
    

    That way, the output will be -

    +---+-----+------+
    | id|score|letter|
    +---+-----+------+
    |  0|    1|     A|
    |  0|    1|     B|
    |  0|    2|     C|
    +---+-----+------+
    

提交回复
热议问题