Is there a proven performance difference between joins in Hive on INT/BIGINT versus VARCHAR?

徘徊边缘 提交于 2019-12-08 17:38:27

问题


For years I have been reading / hearing about the 'performance advantage' of database joins on bigint columns have OVER joins on (var)char columns.

Unfortunately, when looking for real answers / advice regarding to 'simlilar type questions':

  • The examples used are in a 'traditional' RDBMS context, like Mysql or Oracle / SQL Server. Take for instance this question or this example
  • The answer is quite old and the end-difference in runtime is not that great. Again, see this example

I have not seen an example using a version of Hive (preferably version 1.2.1 or higher) where a large (BIG-DATA-ISH) data set (let us say 500 million+ rows) is joined to an similar size dataset on:

  1. a Bigint column
  2. VERSUS a (var)Char(32) column.
  3. VERSUS a (var)Char(255) column.

I am choosing a size of 32 because it is the size of an MD5 Hash, converted to characters and 255 because it is 'in range' of the largest Natural Key I have ever seen.

Futhermore, I would expect Hive:

  • to run under the Tez engine
  • use an (compressed) file format like ORC + ZLip / Snappy

Does anyone know of such an example, substantiated with proof by showing Hive Explain plans, CPU, File & network resources + query runtimes?

来源:https://stackoverflow.com/questions/39247433/is-there-a-proven-performance-difference-between-joins-in-hive-on-int-bigint-ver

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!