Why does the query performance differ with nested columns in Spark SQL?

ⅰ亾dé卋堺 提交于 2019-12-10 14:55:07

问题


I write some data in the Parquet format using Spark SQL where the resulting schema looks like the following:

root
|-- stateLevel: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)
|-- countryLevel: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)
|-- global: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)

I can also transform the same data into a more flat schema that looks like this:

root
|-- stateLevelCount1: integer (nullable = false)
|-- stateLevelCount2: integer (nullable = false)
|-- stateLevelCount3: integer (nullable = false)
|-- stateLevelCount4: integer (nullable = false)
|-- stateLevelCount5: integer (nullable = false)
|-- countryLevelCount1: integer (nullable = false)
|-- countryLevelCount2: integer (nullable = false)
|-- countryLevelCount3: integer (nullable = false)
|-- countryLevelCount4: integer (nullable = false)
|-- countryLevelCount5: integer (nullable = false)
|-- globalCount1: integer (nullable = false)
|-- globalCount2: integer (nullable = false)
|-- globalCount3: integer (nullable = false)
|-- globalCount4: integer (nullable = false)
|-- globalCount5: integer (nullable = false)

Now when I run a query on the first data set on a column like global.count1, it takes a lot longer than querying globalCount1 in the second data set. Conversely, writing the first data set into Parquet takes a lot shorter than writing the 2nd data set. I know that my data is stored in a columnar fashion due to Parquet, but I was thinking that all the nested columns would be stored together individually. In the 1st data set for instance, it seems to that the whole 'global' column is being stored together as opposed to 'global.count1', 'global.count2' etc. values being stored together. Is this expected behavior?


回答1:


Interesting. "it takes a lot longer than querying.. " can you please share how much longer? thanks.

Looking at the code https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/io/RecordReaderImplementation.java#L248 it seems that reading from structures might have some overhead. It shouldn't be "a lot longer" though just looking at the Parquet code.

I think bigger problem is how Spark can push down predicates in such cases. For example, it may not be able to use bloom filters in such cases. Can you please share how you query data in both cases and timings. Which versions of Spark, Parquet, Hadoop etc?

Parquet-1.5 had issue https://issues.apache.org/jira/browse/PARQUET-61 which in some such cases could cause 4-5x slowdown.



来源:https://stackoverflow.com/questions/39622787/why-does-the-query-performance-differ-with-nested-columns-in-spark-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!