Kind of edge case, when saving parquet table in Spark SQL with partition,
#schema definitioin final StructType schema = DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("time", DataTypes.StringType, true), DataTypes.createStructField("accountId", DataTypes.StringType, true), ... DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD); df.coalesce(1) .write() .mode(SaveMode.Append) .format("parquet") .partitionBy("year") .saveAsTable("tblclick8partitioned"); Spark warns:
Persisting partitioned data source relation into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive
In Hive:
hive> describe tblclick8partitioned; OK col array from deserializer Time taken: 0.04 seconds, Fetched: 1 row(s) Obviously the schema is not correct - however if I use saveAsTable in Spark SQL without partition the table can be queried without problem.
Question is how can I make a parquet table in Spark SQL compatible with Hive with partition info?