Cast int96 timestamp from parquet to golang

心已入冬 提交于 2019-11-29 16:49:48

The first 8 bytes are time in nanosecs, not millisecs. They are not measured from the epoch either, but from midnight. The date part is stored separatly in the last 4 bytes as Julian day number.

Here is the result of an experiment I did earlier that may help. I stored '2000-01-01 12:34:56' as an int96 and dumped with parquet-tools:

$ parquet-tools dump hdfs://path/to/parquet/file | tail -n 1
value 1: R:0 D:1 V:117253024523396126668760320

Since 117253024523396126668760320 = 0x60FD4B3229000059682500, the 12 bytes are 00 60 FD 4B 32 29 00 00 | 59 68 25 00, where | shows the boundary between the time and the date parts.

00 60 FD 4B 32 29 00 00 is the time part. We need to reverse the bytes because int96 timestamp use a reverse byte order, thus we get 0x000029324BFD6000 = 45296 * 10^9 nanoseconds = 45296 seconds = 12 hours + 34 minutes + 56 seconds.

59 68 25 00 is the date part, if we reverse the bytes we get 0x00256859 = 2451545 as the Julian day number, which corresponds to 2000-01-01.

ZAky

@Zoltan you definitely deserve the vote although you didn't supply a Golang sulotion.

Thanks to you and to https://github.com/carlosjhr64/jd

I wrote a function func int96ToJulian(parquetDate []byte) time.Time

playground

func int96ToJulian(parquetDate []byte) time.Time {
    nano := binary.LittleEndian.Uint64(parquetDate[:8])
    dt := binary.LittleEndian.Uint32(parquetDate[8:])

    l := dt + 68569
    n := 4 * l / 146097
    l = l - (146097*n+3)/4
    i := 4000 * (l + 1) / 1461001
    l = l - 1461*i/4 + 31
    j := 80 * l / 2447
    k := l - 2447*j/80
    l = j / 11
    j = j + 2 - 12*l
    i = 100*(n-49) + i + l

    tm := time.Date(int(i), time.Month(j), int(k), 0, 0, 0, 0, time.UTC)
    return tm.Add(time.Duration(nano))
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!