Avoid losing data type for the partitioned data when writing from Spark

丶灬走出姿态 提交于 2019-12-10 10:59:27

问题


I have a dataframe like below.

itemName, itemCategory
Name1, C0
Name2, C1
Name3, C0

I would like to save this dataframe as partitioned parquet file:

df.write.mode("overwrite").partitionBy("itemCategory").parquet(path)

For this dataframe, when I read the data back, it will have String the data type for itemCategory.

However at times, I have dataframe from other tenants as below.

itemName, itemCategory
Name1, 0
Name2, 1
Name3, 0

In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory.

Parquet file has the metadata that describe the data type. How can I specify the data type for the partition so it will be read back as String instead of Int?


回答1:


If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.

In spark 2.0 or greater, you can set like this:

spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

In 1.6, like this:

sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

The downside is you have to do this each time you read the data, but at least it works.




回答2:


As you partition by the itemCategory column, this data will be stored in the file structure and not in the actual csv files. Spark infer the datatype depending on the values, if all values are integers then the column type will be int.

One simple solution would be to cast the column to StringType after reading the data:

import spark.implicits._
df.withColumn("itemCategory", $"itemCategory".cast(StringType))

Another option would be to duplicate the column itself. Then one of the columns will be used for the partitioning and, hence, be saved in the file structure. However, the other duplicated column would be saved normally in the parquet file. To make a duplicate simply use:

df.withColumn("itemCategoryCopy", $"itemCategory")



回答3:


Read it with a schema:

import spark.implicits._
val path = "/tmp/test/input"
val source = Seq(("Name1", "0"), ("Name2", "1"), ("Name3", "0")).toDF("itemName", "itemCategory")
source.write.partitionBy("itemCategory").parquet(path)
spark.read.schema(source.schema).parquet(path).printSchema() 
// will print 
// root
// |-- itemName: string (nullable = true)
// |-- itemCategory: string (nullable = true)

See https://www.zepl.com/viewer/notebooks/bm90ZTovL2R2aXJ0ekBnbWFpbC5jb20vMzEzZGE2ZmZjZjY0NGRiZjk2MzdlZDE4NjEzOWJlZWYvbm90ZS5qc29u



来源:https://stackoverflow.com/questions/46657954/avoid-losing-data-type-for-the-partitioned-data-when-writing-from-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!