I have a job which loads a DataFrame object and then saves the data to parquet format using the DataFrame partitionBy
method. Then I publish the paths created s
In case like this you should provide basePath
option
:
(spark.read
.format("parquet")
.option("basePath", "hdfs://localhost:9000/ptest/")
.load("hdfs://localhost:9000/ptest/id=0/"))
which points to the root directory of your data.
With basePath
DataFrameReader
will be aware of the partitioning and adjust schema accordingly.
If the other application is loading specific partition, which it looks like from load("hdfs://localhost:9000/ptest/id=0/")
path, that application can tweak code to replace null with partition column value
part = 0 # partition to load
df2 =spark.read.format("parquet")\
.schema(df.schema)\
.load("ptest/id="+str(part)).fillna(part,["id"])
That way, the output will be -
+---+-----+------+
| id|score|letter|
+---+-----+------+
| 0| 1| A|
| 0| 1| B|
| 0| 2| C|
+---+-----+------+