Avoid losing data type for the partitioned data when writing from Spark

会有一股神秘感。 提交于 2019-12-06 07:02:14

If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.

In spark 2.0 or greater, you can set like this:

spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

In 1.6, like this:

sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

The downside is you have to do this each time you read the data, but at least it works.

As you partition by the itemCategory column, this data will be stored in the file structure and not in the actual csv files. Spark infer the datatype depending on the values, if all values are integers then the column type will be int.

One simple solution would be to cast the column to StringType after reading the data:

import spark.implicits._
df.withColumn("itemCategory", $"itemCategory".cast(StringType))

Another option would be to duplicate the column itself. Then one of the columns will be used for the partitioning and, hence, be saved in the file structure. However, the other duplicated column would be saved normally in the parquet file. To make a duplicate simply use:

df.withColumn("itemCategoryCopy", $"itemCategory")

Read it with a schema:

import spark.implicits._
val path = "/tmp/test/input"
val source = Seq(("Name1", "0"), ("Name2", "1"), ("Name3", "0")).toDF("itemName", "itemCategory")
source.write.partitionBy("itemCategory").parquet(path)
spark.read.schema(source.schema).parquet(path).printSchema() 
// will print 
// root
// |-- itemName: string (nullable = true)
// |-- itemCategory: string (nullable = true)

See https://www.zepl.com/viewer/notebooks/bm90ZTovL2R2aXJ0ekBnbWFpbC5jb20vMzEzZGE2ZmZjZjY0NGRiZjk2MzdlZDE4NjEzOWJlZWYvbm90ZS5qc29u

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!