Avoid losing data type for the partitioned data when writing from Spark

问题

I have a dataframe like below.

itemName, itemCategory
Name1, C0
Name2, C1
Name3, C0

I would like to save this dataframe as partitioned parquet file:

df.write.mode("overwrite").partitionBy("itemCategory").parquet(path)

For this dataframe, when I read the data back, it will have String the data type for itemCategory.

However at times, I have dataframe from other tenants as below.

itemName, itemCategory
Name1, 0
Name2, 1
Name3, 0

In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory.

Parquet file has the metadata that describe the data type. How can I specify the data type for the partition so it will be read back as String instead of Int?

回答1:

If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.

In spark 2.0 or greater, you can set like this:

spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

In 1.6, like this:

sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

The downside is you have to do this each time you read the data, but at least it works.

回答2:

As you partition by the itemCategory column, this data will be stored in the file structure and not in the actual csv files. Spark infer the datatype depending on the values, if all values are integers then the column type will be int.

One simple solution would be to cast the column to StringType after reading the data:

import spark.implicits._
df.withColumn("itemCategory", $"itemCategory".cast(StringType))

Another option would be to duplicate the column itself. Then one of the columns will be used for the partitioning and, hence, be saved in the file structure. However, the other duplicated column would be saved normally in the parquet file. To make a duplicate simply use:

df.withColumn("itemCategoryCopy", $"itemCategory")

回答3:

Read it with a schema:

import spark.implicits._
val path = "/tmp/test/input"
val source = Seq(("Name1", "0"), ("Name2", "1"), ("Name3", "0")).toDF("itemName", "itemCategory")
source.write.partitionBy("itemCategory").parquet(path)
spark.read.schema(source.schema).parquet(path).printSchema() 
// will print 
// root
// |-- itemName: string (nullable = true)
// |-- itemCategory: string (nullable = true)

See https://www.zepl.com/viewer/notebooks/bm90ZTovL2R2aXJ0ekBnbWFpbC5jb20vMzEzZGE2ZmZjZjY0NGRiZjk2MzdlZDE4NjEzOWJlZWYvbm90ZS5qc29u

来源：https://stackoverflow.com/questions/46657954/avoid-losing-data-type-for-the-partitioned-data-when-writing-from-spark

标签

apache-spark

spark-dataframe

parquet