How to read and write Map<String, Object> from/to parquet file in Java or Scala?

你。 提交于 2019-12-18 19:06:11

问题


Looking for a concise example on how to read and write Map<String, Object> from/to parquet file in Java or Scala?

Here is expected structure, using com.fasterxml.jackson.databind.ObjectMapper as serializer in Java (i.e. looking for equivalent using parquet):

public static Map<String, Object> read(InputStream inputStream) throws IOException {
    ObjectMapper objectMapper = new ObjectMapper();

    return objectMapper.readValue(inputStream, new TypeReference<Map<String, Object>>() {

    });
}

public static void write(OutputStream outputStream, Map<String, Object> map) throws IOException {
    ObjectMapper objectMapper = new ObjectMapper();

    objectMapper.writeValue(outputStream, map);        
}

回答1:


i'm not quite good about parquet but, from here:

Schema schema = new Schema.Parser().parse(Resources.getResource("map.avsc").openStream());

    File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
    tmp.deleteOnExit();
    tmp.delete();
    Path file = new Path(tmp.getPath());

    AvroParquetWriter<GenericRecord> writer = 
        new AvroParquetWriter<GenericRecord>(file, schema);

    // Write a record with an empty map.
    ImmutableMap emptyMap = new ImmutableMap.Builder<String, Integer>().build();
    GenericData.Record record = new GenericRecordBuilder(schema)
        .set("mymap", emptyMap).build();
    writer.write(record);
    writer.close();

    AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
    GenericRecord nextRecord = reader.read();

    assertNotNull(nextRecord);
    assertEquals(emptyMap, nextRecord.get("mymap"));

In your situation change ImmutableMap (Google Collections) with a default Map as below:

Schema schema = new Schema.Parser().parse( Resources.getResource( "map.avsc" ).openStream() );

        File tmp = File.createTempFile( getClass().getSimpleName(), ".tmp" );
        tmp.deleteOnExit();
        tmp.delete();
        Path file = new Path( tmp.getPath() );

        AvroParquetWriter<GenericRecord> writer = new AvroParquetWriter<GenericRecord>( file, schema );

        // Write a record with an empty map.
        Map<String,Object> emptyMap = new HashMap<String, Object>();

        // not empty any more
        emptyMap.put( "SOMETHING", new SOMETHING() );
        GenericData.Record record = new GenericRecordBuilder( schema ).set( "mymap", emptyMap ).build();
        writer.write( record );
        writer.close();

        AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>( file );
        GenericRecord nextRecord = reader.read();

        assertNotNull( nextRecord );
        assertEquals( emptyMap, nextRecord.get( "mymap" ) );

I didn't test the code, but give it a try..




回答2:


I doubt there is a solution to this readily available. When you talk about Maps, its still possible to create a AvroSchema out of it provided the values of the maps is a primitive type, or a complexType which inturn has primitive type fields.

In your case,

  • If you have a Map => which will create schema with values of map being int.
  • If you have a Map,
    • a. CustomObject has fields int, float, char ... (i.e. any primitive type) the schema generation will be valid and can then be used to successfully convert to parquet.
    • b. CustomObject has fields which are non primitive, the schema generated will be malformed and the resulting ParquetWritter will fail.

To resolve this issue, you can try to convert your object into a JsonObject and then use the Apache Spark libraries to convert it to Parquet.




回答3:


Apache Drill is your answer!

Convert to parquet : You can use the CTAS(create table as) feature in drill. By default drill creates a folder with parquet files after executing the below query. You can substitute any query and drill writes the output of you query into parquet files

create table file_parquet as select * from dfs.`/data/file.json`;

Convert from parquet : We also use the CTAS feature here, however we request drill to use a different format for writing the output

alter session set `store.format`='json';
create table file_json as select * from dfs.`/data/file.parquet`;

Refer to http://drill.apache.org/docs/create-table-as-ctas-command/ for more information



来源:https://stackoverflow.com/questions/30565510/how-to-read-and-write-mapstring-object-from-to-parquet-file-in-java-or-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!