Parquet vs ORC vs ORC with Snappy

前端 未结 5 2056
暗喜
暗喜 2020-12-12 09:28

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Sn

5条回答
  •  轮回少年
    2020-12-12 10:25

    You are seeing this because:

    • Hive has a vectorized ORC reader but no vectorized parquet reader.

    • Spark has a vectorized parquet reader and no vectorized ORC reader.

    • Spark performs best with parquet, hive performs best with ORC.

    I've seen similar differences when running ORC and Parquet with Spark.

    Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

    (correct as of Hive 2.0 and Spark 2.1)

提交回复
热议问题