Spark's int96 time type

爱⌒轻易说出口 提交于 2019-12-05 10:15:14

Semantics is determined based on the metadata. We'll need some imports:

import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration

example data:

val path = "/tmp/ts"

Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts")
  .withColumn("ts", $"ts".cast("timestamp"))
  .write.mode("overwrite").parquet(path)

and Hadoop configuration:

val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)

Now we can access Spark metadata:

ParquetFileReader
  .readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
  .get(0)
  .getParquetMetadata
  .getFileMetaData
  .getKeyValueMetaData
  .get("org.apache.spark.sql.parquet.row.metadata")

and the result is:

String = {"type":"struct","fields: [
  {"name":"id","type":"integer","nullable":false,"metadata":{}},
  {"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}

Equivalent information can be stored in the Metastore as well.

According to the official documentation this is used to achieve compatibility with Hive and Impala:

Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.

and can be controlled using spark.sql.parquet.int96AsTimestamp property.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!