Hive Buckets-understanding TABLESAMPLE(BUCKET X OUT OF Y)

元气小坏坏 提交于 2020-06-25 10:28:28

问题


Hi i am very much new to hive,i have gone through buckets concept in hadoop in action,but failed to understand the below lines.can any one help me on this?

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 32);

The general syntax for TABLESAMPLE is TABLESAMPLE(BUCKET x OUT OF y)

The sample size for the query is around 1/y. In addition, y needs to be a multiple or factor of the number of buckets specified for the table at table creation time. For example, if we change y to 16, the query becomes

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 16);

Then the sample size includes approximately 1 out of every 16 users (as the bucket column is userid). The table still has 32 buckets, but Hive tries to satisfy this query by processing buckets 1 and 17 together. On the other hand, if y is specified to be 64, Hive will execute the query on half of the data in one bucket. The value of x is only used to select which bucket to use. Under truly random sampling its value shouldn’t matter.


回答1:


Which part of it don't you understand?

When you create the table and bucket it using the clustered by clause into 32 buckets (as an example), hive buckets your data into 32 buckets using deterministic hash functions. Then when you use TABLESAMPLE(BUCKET x OUT OF y), hive divides your buckets into groups of y buckets and then picks the x'th bucket of each group. For example:

  • If you use TABLESAMPLE(BUCKET 6 OUT OF 8), hive would divide your 32 buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30.

  • If you use TABLESAMPLE(BUCKET 23 OUT OF 32), hive would divide your 32 buckets into groups of 32, resulting in only 1 group of 32 buckets, and then picks the 23rd bucket as your result.

  • If you use TABLESAMPLE(BUCKET 3 OUT OF 64), hive would divide your 32 buckets into groups of 64 buckets, resulting in 1 group of 64 "half-bucket"s and then picks the half-bucket that corresponds to the 3rd full-bucket.



来源:https://stackoverflow.com/questions/18781869/hive-buckets-understanding-tablesamplebucket-x-out-of-y

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!