Spark csv data validation failed for date and timestamp data types of Hive

狂风中的少年 提交于 2019-12-25 07:47:01

问题


Hive Table Schema:

c_date                  date                                        
c_timestamp             timestamp   

It's text table

Hive Table data:

hive> select * from all_datetime_types;
OK
0001-01-01  0001-01-01 00:00:00.000000001
9999-12-31  9999-12-31 23:59:59.999999999

csv obtained after spark job:

c_date,c_timestamp
0001-01-01 00:00:00.0,0001-01-01 00:00:00.0
9999-12-31 00:00:00.0,9999-12-31 23:59:59.999

Issues:

  • 00:00:00.0 is added in date type
  • timestamp is truncated to milliseconds precision

Useful code:

SparkConf conf = new SparkConf(true).setMaster("yarn-cluster").setAppName("SAMPLE_APP");
SparkContext sc = new SparkContext(conf);
HiveContext hc = new HiveContext(sc);
DataFrame df = hc.table("testdb.all_datetime_types");
df.printSchema();
DataFrameWriter writer = df.repartition(1).write();
writer.format("com.databricks.spark.csv").option("header", "true").save(outputHdfsFile);

I am aware of dateFormat option. But date and timestamp column can have different formats in Hive.

Can I simply covert all columns to String?


回答1:


you can use the timestampFormat option in spark to specify your time stamp format.

spark.read.option("timestampFormat", "MM/dd/yyyy h:mm:ss a").csv("path")



回答2:


Spark supports up to nanoseconds precision of Timestamp. You can probably try mapping date and timestamp columns like below,

DataFrame df = hiveContext.sql("select from_unixtime(unix_timestamp(date, 'yyyy-MM-dd'),'yyyy-MM-dd'), from_unixtime(unix_timestamp(timestamp, 'yyyy-MM-dd HH:mm:ss.SSSSSS'),'yyyy-MM-dd HH:mm:ss.SSSSSS') from table")


来源:https://stackoverflow.com/questions/42979217/spark-csv-data-validation-failed-for-date-and-timestamp-data-types-of-hive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!