Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

霸气de小男生 提交于 2019-11-30 10:19:33

CSV is a row-oriented format, while Parquet is a column-oriented format.

Typically row-oriented formats are more efficient for queries that either must access most of the columns, or only read a fraction of the rows. Column-oriented formats, on the other hand, are usually more efficient for queries that need to read most of the rows, but only have to access a fraction of the columns. Analytical queries typically fall in the latter category, while transactional queries are more often in the first category.

Additionally, CSV is a text-based format, which can not be parsed as efficiently as a binary format. This makes CSV even slower. A typical column-oriented format on the other hand is not only binary, but also allows more efficient compression, which leads to smaller disk usage and faster access. I recommend reading the Introduction section of The Design and Implementation of Modern Column-Oriented Database Systems.

Since the Hadoop ecosystem is for analytical queries, Parquet is generally a better choice for performance than CSV for Hadoop applications.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!