How to read and write Map from/to parquet file in Java or Scala?

问题

Looking for a concise example on how to read and write Map<String, Object> from/to parquet file in Java or Scala?

Here is expected structure, using com.fasterxml.jackson.databind.ObjectMapper as serializer in Java (i.e. looking for equivalent using parquet):

public static Map<String, Object> read(InputStream inputStream) throws IOException {
    ObjectMapper objectMapper = new ObjectMapper();

    return objectMapper.readValue(inputStream, new TypeReference<Map<String, Object>>() {

    });
}

public static void write(OutputStream outputStream, Map<String, Object> map) throws IOException {
    ObjectMapper objectMapper = new ObjectMapper();

    objectMapper.writeValue(outputStream, map);        
}

回答1:

i'm not quite good about parquet but, from here:

Schema schema = new Schema.Parser().parse(Resources.getResource("map.avsc").openStream());

    File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
    tmp.deleteOnExit();
    tmp.delete();
    Path file = new Path(tmp.getPath());

    AvroParquetWriter<GenericRecord> writer = 
        new AvroParquetWriter<GenericRecord>(file, schema);

    // Write a record with an empty map.
    ImmutableMap emptyMap = new ImmutableMap.Builder<String, Integer>().build();
    GenericData.Record record = new GenericRecordBuilder(schema)
        .set("mymap", emptyMap).build();
    writer.write(record);
    writer.close();

    AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
    GenericRecord nextRecord = reader.read();

    assertNotNull(nextRecord);
    assertEquals(emptyMap, nextRecord.get("mymap"));

In your situation change ImmutableMap (Google Collections) with a default Map as below:

Schema schema = new Schema.Parser().parse( Resources.getResource( "map.avsc" ).openStream() );

        File tmp = File.createTempFile( getClass().getSimpleName(), ".tmp" );
        tmp.deleteOnExit();
        tmp.delete();
        Path file = new Path( tmp.getPath() );

        AvroParquetWriter<GenericRecord> writer = new AvroParquetWriter<GenericRecord>( file, schema );

        // Write a record with an empty map.
        Map<String,Object> emptyMap = new HashMap<String, Object>();

        // not empty any more
        emptyMap.put( "SOMETHING", new SOMETHING() );
        GenericData.Record record = new GenericRecordBuilder( schema ).set( "mymap", emptyMap ).build();
        writer.write( record );
        writer.close();

        AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>( file );
        GenericRecord nextRecord = reader.read();

        assertNotNull( nextRecord );
        assertEquals( emptyMap, nextRecord.get( "mymap" ) );

I didn't test the code, but give it a try..

回答2:

I doubt there is a solution to this readily available. When you talk about Maps, its still possible to create a AvroSchema out of it provided the values of the maps is a primitive type, or a complexType which inturn has primitive type fields.

In your case,

If you have a Map => which will create schema with values of map being int.
If you have a Map,
- a. CustomObject has fields int, float, char ... (i.e. any primitive type) the schema generation will be valid and can then be used to successfully convert to parquet.
- b. CustomObject has fields which are non primitive, the schema generated will be malformed and the resulting ParquetWritter will fail.

To resolve this issue, you can try to convert your object into a JsonObject and then use the Apache Spark libraries to convert it to Parquet.

回答3:

Apache Drill is your answer!

Convert to parquet : You can use the CTAS(create table as) feature in drill. By default drill creates a folder with parquet files after executing the below query. You can substitute any query and drill writes the output of you query into parquet files

create table file_parquet as select * from dfs.`/data/file.json`;

Convert from parquet : We also use the CTAS feature here, however we request drill to use a different format for writing the output

alter session set `store.format`='json';
create table file_json as select * from dfs.`/data/file.parquet`;

Refer to http://drill.apache.org/docs/create-table-as-ctas-command/ for more information

来源：https://stackoverflow.com/questions/30565510/how-to-read-and-write-mapstring-object-from-to-parquet-file-in-java-or-scala

标签

java

scala

avro

parquet

How to read and write Map<String, Object> from/to parquet file in Java or Scala?

问题

回答1:

回答2:

回答3: