Parquet vs ORC vs ORC with Snappy

前端未结

关注

 5  2056

暗喜 2020-12-12 09:28

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Sn

5条回答

轮回少年 (楼主)

2020-12-12 10:25
You are seeing this because:
- Hive has a vectorized ORC reader but no vectorized parquet reader.
- Spark has a vectorized parquet reader and no vectorized ORC reader.
- Spark performs best with parquet, hive performs best with ORC.
I've seen similar differences when running ORC and Parquet with Spark.

Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

(correct as of Hive 2.0 and Spark 2.1)
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...