Why is partition key column missing from DataFrame

后端 未结 2 1534
萌比男神i
萌比男神i 2020-12-11 07:56

I have a job which loads a DataFrame object and then saves the data to parquet format using the DataFrame partitionBy method. Then I publish the paths created s

相关标签:
2条回答
  • 2020-12-11 08:36

    In case like this you should provide basePath option:

    (spark.read
        .format("parquet")
        .option("basePath", "hdfs://localhost:9000/ptest/")
        .load("hdfs://localhost:9000/ptest/id=0/"))
    

    which points to the root directory of your data.

    With basePath DataFrameReader will be aware of the partitioning and adjust schema accordingly.

    0 讨论(0)
  • 2020-12-11 08:36

    If the other application is loading specific partition, which it looks like from load("hdfs://localhost:9000/ptest/id=0/") path, that application can tweak code to replace null with partition column value

    part = 0 # partition to load 
    df2 =spark.read.format("parquet")\
                   .schema(df.schema)\
                   .load("ptest/id="+str(part)).fillna(part,["id"])
    

    That way, the output will be -

    +---+-----+------+
    | id|score|letter|
    +---+-----+------+
    |  0|    1|     A|
    |  0|    1|     B|
    |  0|    2|     C|
    +---+-----+------+
    
    0 讨论(0)
提交回复
热议问题