Spark's int96 time type

When you create a timestamp column in spark, and save to parquet, you get a 12 byte integer column type (int96); I gather the data is split into 6-bytes for Julian day and 6 bytes for nanoseconds within the day.

This does not conform to any parquet logical type. The schema in the parquet file does not, then, give an indication of the column being anything but an integer.

My question is, how does Spark know to load such a column as a timestamp as opposed to a big integer?

Semantics is determined based on the metadata. We'll need some imports:

import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration

example data:

val path = "/tmp/ts"

Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts")
  .withColumn("ts", $"ts".cast("timestamp"))
  .write.mode("overwrite").parquet(path)

and Hadoop configuration:

val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)

Now we can access Spark metadata:

ParquetFileReader
  .readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
  .get(0)
  .getParquetMetadata
  .getFileMetaData
  .getKeyValueMetaData
  .get("org.apache.spark.sql.parquet.row.metadata")

and the result is:

String = {"type":"struct","fields: [
  {"name":"id","type":"integer","nullable":false,"metadata":{}},
  {"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}

Equivalent information can be stored in the Metastore as well.

According to the official documentation this is used to achieve compatibility with Hive and Impala:

Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.

and can be controlled using spark.sql.parquet.int96AsTimestamp property.

来源：https://stackoverflow.com/questions/42628287/sparks-int96-time-type

标签

datetime

apache-spark

parquet