Why is partition key column missing from DataFrame

后端未结

关注

 2  1539

萌比男神i 2020-12-11 07:56

I have a job which loads a DataFrame object and then saves the data to parquet format using the DataFrame partitionBy method. Then I publish the paths created s

2条回答

被撕碎了的回忆 (楼主)

2020-12-11 08:36
If the other application is loading specific partition, which it looks like from load("hdfs://localhost:9000/ptest/id=0/") path, that application can tweak code to replace null with partition column value
```
part = 0 # partition to load 
df2 =spark.read.format("parquet")\
               .schema(df.schema)\
               .load("ptest/id="+str(part)).fillna(part,["id"])
```
That way, the output will be -
```
+---+-----+------+
| id|score|letter|
+---+-----+------+
|  0|    1|     A|
|  0|    1|     B|
|  0|    2|     C|
+---+-----+------+
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...