Problems saving partitioned parquet HIVE table from Spark

问题

Spark 1.6.0 Hive 1.1.0-cdh5.8.0

I have some problems saving my dataframe into parquet-backed partitioned Hive table from Spark.

here is my code:

val df = sqlContext.createDataFrame(rowRDD, schema)
df.write
  .mode(SaveMode.Append)
  .format("parquet")
  .partitionBy("year")
  .saveAsTable(output)

nothing special, actually, but I can't read any data from the table when it's generated.

Key point is in partitioning - without it everything works fine. Here are my steps to fix the problem:

At first, on simple select hive returns that the table is not partitioned. - Ok, it seems like Spark forget to mention about partitioning scheme in DDL. I fixed it creating the table manually
Attempt #2 - still nothing, what is going on actually, is that hive metastore doesn't know that the table has any partitions in dwh. Fixed it by: hive> msck repair table
Attempt #3 - nope, and now hive bursts with exception, smth like: java.io.IOException:ort.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException. Ok, spark defined wrong serializer. Fixed it setting STORED AS PARQUET
Nope. Don't remember what exceptin it was, but I realized that spark replaced my scheme with single column: col array COMMENT 'from deserializer' I replaced it with corrent one - some another problem came out.

And here I'm done. To me it seems like spark generates totally wrong ddl trying to create non-existing table in hive. But everything works pretty fine when I just remove partition statement.

So where am I wrong or, perhaps, there is a quick fix to that problem?

来源：https://stackoverflow.com/questions/38784450/problems-saving-partitioned-parquet-hive-table-from-spark

标签

apache-spark

Hive

partitioning

parquet