create parquet files in java

问题

Is there a way to create parquet files from java?

I have data in memory (java classes) and I want to write it into a parquet file, to later read it from apache-drill.

Is there an simple way to do this, like inserting data into a sql table?

GOT IT

Thanks for the help.

Combining the answers and this link, I was able to create a parquet file and read it back with drill.

回答1:

ParquetWriter's constructors are deprecated(1.8.1) but not ParquetWriter itself, you can still create ParquetWriter by extending abstract Builder subclass inside of it.

Here an example from parquet creators themselves ExampleParquetWriter:

  public static class Builder extends ParquetWriter.Builder<Group, Builder> {
    private MessageType type = null;
    private Map<String, String> extraMetaData = new HashMap<String, String>();

    private Builder(Path file) {
      super(file);
    }

    public Builder withType(MessageType type) {
      this.type = type;
      return this;
    }

    public Builder withExtraMetaData(Map<String, String> extraMetaData) {
      this.extraMetaData = extraMetaData;
      return this;
    }

    @Override
    protected Builder self() {
      return this;
    }

    @Override
    protected WriteSupport<Group> getWriteSupport(Configuration conf) {
      return new GroupWriteSupport(type, extraMetaData);
    }

  }

If you don't want to use Group and GroupWriteSupport(bundled in Parquet but purposed just as an example of data-model implementation) you can go with Avro, Protocol Buffers, or Thrift in-memory data models. Here is an example using writing Parquet using Avro:

try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
        .<GenericData.Record>builder(fileToWrite)
        .withSchema(schema)
        .withConf(new Configuration())
        .withCompressionCodec(CompressionCodecName.SNAPPY)
        .build()) {
    for (GenericData.Record record : recordsToWrite) {
        writer.write(record);
    }
}

You will need these dependencies:

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-avro</artifactId>
    <version>1.8.1</version>
</dependency>

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-hadoop</artifactId>
    <version>1.8.1</version>
</dependency>

Full example here.

回答2:

A few possible ways to do it:

Use the Java Parquet library to write Parquet directly from your code.
Connect to Hive or Impala using JDBC and insert the data using SQL. Please note that if you insert rows one by one it will result in separate files for each individual record and will totally ruin the performance. You should insert lots of rows at once, which is not trivial, so I don't recommend this approach.
Save the data to a delimited text file, then do the following steps in either Hive or Impala:
- Define a table over the text file to allow Hive/Impala to read the data. Let's call this table text_table. See Impala's Create Table Statement for details.
- Create a new table with identical columns but specifying Parquet as its file format. Let's call this table parquet_table.
- Finally do an insert into parquet_table select * from text_table to copy all data from the text file to the parquet table.

来源：https://stackoverflow.com/questions/39728854/create-parquet-files-in-java

标签

java

parquet