Spark SQL saveAsTable is not compatible with Hive when partition is specified

匿名 (未验证) 提交于 2019-12-03 01:48:02

问题:

Kind of edge case, when saving parquet table in Spark SQL with partition,

#schema definitioin final StructType schema = DataTypes.createStructType(Arrays.asList(     DataTypes.createStructField("time", DataTypes.StringType, true),     DataTypes.createStructField("accountId", DataTypes.StringType, true),     ...  DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD);  df.coalesce(1)     .write()     .mode(SaveMode.Append)     .format("parquet")     .partitionBy("year")     .saveAsTable("tblclick8partitioned"); 

Spark warns:

Persisting partitioned data source relation into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive

In Hive:

hive> describe tblclick8partitioned; OK col                     array           from deserializer Time taken: 0.04 seconds, Fetched: 1 row(s) 

Obviously the schema is not correct - however if I use saveAsTable in Spark SQL without partition the table can be queried without problem.

Question is how can I make a parquet table in Spark SQL compatible with Hive with partition info?

回答1:

That's because DataFrame.saveAsTable creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable. An example from SPARK-14927 looks like this:

hc.sql("create external table tmp.partitiontest1(val string) partitioned by (year int)")  Seq(2012 -> "a", 2013 -> "b", 2014 -> "c").toDF("year", "val")   .write   .partitionBy("year")   .mode(SaveMode.Append)   .saveAsTable("tmp.partitiontest1") 


回答2:

A solution is to create the table with Hive and then save the data with ...partitionBy("year").insertInto("default.mytable").

In my experience, creating the table in Hive and then using ...partitionBy("year").saveAsTable("default.mytable") did not work. This is with Spark 1.6.2.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!