Why is partition key column missing from DataFrame

后端未结

关注

 2  1534

I have a job which loads a DataFrame object and then saves the data to parquet format using the DataFrame partitionBy method. Then I publish the paths created s

相关标签:

2条回答

日久生厌

2020-12-11 08:36
In case like this you should provide basePath option:
```
(spark.read
    .format("parquet")
    .option("basePath", "hdfs://localhost:9000/ptest/")
    .load("hdfs://localhost:9000/ptest/id=0/"))
```
which points to the root directory of your data.

With basePath DataFrameReader will be aware of the partitioning and adjust schema accordingly.
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-12-11 08:36
If the other application is loading specific partition, which it looks like from load("hdfs://localhost:9000/ptest/id=0/") path, that application can tweak code to replace null with partition column value
```
part = 0 # partition to load 
df2 =spark.read.format("parquet")\
               .schema(df.schema)\
               .load("ptest/id="+str(part)).fillna(part,["id"])
```
That way, the output will be -
```
+---+-----+------+
| id|score|letter|
+---+-----+------+
|  0|    1|     A|
|  0|    1|     B|
|  0|    2|     C|
+---+-----+------+
```
0 讨论(0)
发布评论:

提交评论
- 加载中...