Problems saving partitioned parquet HIVE table from Spark

匆匆过客 提交于 2020-01-01 18:57:08

问题


Spark 1.6.0 Hive 1.1.0-cdh5.8.0

I have some problems saving my dataframe into parquet-backed partitioned Hive table from Spark.

here is my code:

val df = sqlContext.createDataFrame(rowRDD, schema)
df.write
  .mode(SaveMode.Append)
  .format("parquet")
  .partitionBy("year")
  .saveAsTable(output)

nothing special, actually, but I can't read any data from the table when it's generated.

Key point is in partitioning - without it everything works fine. Here are my steps to fix the problem:

  1. At first, on simple select hive returns that the table is not partitioned. - Ok, it seems like Spark forget to mention about partitioning scheme in DDL. I fixed it creating the table manually

  2. Attempt #2 - still nothing, what is going on actually, is that hive metastore doesn't know that the table has any partitions in dwh. Fixed it by: hive> msck repair table

  3. Attempt #3 - nope, and now hive bursts with exception, smth like: java.io.IOException:ort.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException. Ok, spark defined wrong serializer. Fixed it setting STORED AS PARQUET

  4. Nope. Don't remember what exceptin it was, but I realized that spark replaced my scheme with single column: col array COMMENT 'from deserializer' I replaced it with corrent one - some another problem came out.

And here I'm done. To me it seems like spark generates totally wrong ddl trying to create non-existing table in hive. But everything works pretty fine when I just remove partition statement.

So where am I wrong or, perhaps, there is a quick fix to that problem?

来源:https://stackoverflow.com/questions/38784450/problems-saving-partitioned-parquet-hive-table-from-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!