Example to read and write parquet file using ParquetIO through Apache Beam

为君一笑 提交于 2021-02-09 17:46:08

问题


Has anybody tried reading/writing Parquet file using Apache Beam. Support is added recently in version 2.5.0, hence not much documentation.

I am trying to read json input file and would like to write to parquet format.

Thanks in advance.


回答1:


You will need to use ParquetIO.Sink. It implements FileIO.




回答2:


Add the following dependency as ParquetIO in different module.

<dependency>
    <groupId>org.apache.beam</groupId>;
    <artifactId&gt;beam-sdks-java-io-parquet</artifactId>;
    <version>2.6.0</version>;
</dependency>;

//Here is code to read and write....

PCollection<JsonObject> input = #Your data
PCollection<GenericRecord> pgr =input.apply("parse json", ParDo.of(new DoFn<JsonObject, GenericRecord> {
        @ProcessElement
        public void processElement(ProcessContext context) {
            JsonObject json= context.getElement();
            GenericRecord record = #convert json to GenericRecord with schema
            context.output(record);
        }
    }));
pgr.apply(FileIO.<GenericRecord>write().via(ParquetIO.sink(schema)).to("path/to/save"));

PCollection<GenericRecord> data = pipeline.apply(
            ParquetIO.read(schema).from("path/to/read"));


来源:https://stackoverflow.com/questions/51168918/example-to-read-and-write-parquet-file-using-parquetio-through-apache-beam

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!