Avro vs. Parquet

后端 未结 7 1000
自闭症患者
自闭症患者 2020-12-07 09:39

I\'m planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full

7条回答
  •  臣服心动
    2020-12-07 10:12

    If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,

    job.setOutputFormatClass(AvroKeyOutputFormat.class);
    AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());
    

    for

    job.setOutputFormatClass(AvroParquetOutputFormat.class);
    AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());
    

    The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.

    In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.

    I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.

提交回复
热议问题