How to write TIMESTAMP logical type (INT96) to parquet, using ParquetWriter?

前端未结

关注

 2  1512

梦毁少年i

I have a tool that uses a org.apache.parquet.hadoop.ParquetWriter to convert CSV data files to parquet data files.

Currently, it only handles int32,

相关标签:

2条回答

被撕碎了的回忆

2020-12-30 15:07

I figured it out, using this code from spark sql as a reference.

The INT96 binary encoding is split into 2 parts: First 8 bytes are nanoseconds since midnight Last 4 bytes is Julian day

String value = "2019-02-13 13:35:05";

final long NANOS_PER_HOUR = TimeUnit.HOURS.toNanos(1);
final long NANOS_PER_MINUTE = TimeUnit.MINUTES.toNanos(1);
final long NANOS_PER_SECOND = TimeUnit.SECONDS.toNanos(1);

// Parse date
SimpleDateFormat parser = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Calendar cal = Calendar.getInstance(TimeZone.getTimeZone("UTC"));
cal.setTime(parser.parse(value));

// Calculate Julian days and nanoseconds in the day
LocalDate dt = LocalDate.of(cal.get(Calendar.YEAR), cal.get(Calendar.MONTH)+1, cal.get(Calendar.DAY_OF_MONTH));
int julianDays = (int) JulianFields.JULIAN_DAY.getFrom(dt);
long nanos = (cal.get(Calendar.HOUR_OF_DAY) * NANOS_PER_HOUR)
        + (cal.get(Calendar.MINUTE) * NANOS_PER_MINUTE)
        + (cal.get(Calendar.SECOND) * NANOS_PER_SECOND);

// Write INT96 timestamp
byte[] timestampBuffer = new byte[12];
ByteBuffer buf = ByteBuffer.wrap(timestampBuffer);
buf.order(ByteOrder.LITTLE_ENDIAN).putLong(nanos).putInt(julianDays);

// This is the properly encoded INT96 timestamp
Binary tsValue = Binary.fromReusedByteArray(timestampBuffer);

0 讨论(0)

难免孤独

2020-12-30 15:32
1. INT96 timestamps use the INT96 physical type without any logical type, so don't annotate them with anything.
2. If you are interested in the structure of an INT96 timestamp, take a look here. If you would like to see sample code that converts to and from this format, take a look at this file from Hive.
0 讨论(0)
发布评论:

提交评论
- 加载中...