问题
Hi i am very much new to hive,i have gone through buckets concept in hadoop in action,but failed to understand the below lines.can any one help me on this?
SELECT avg(viewTime)
FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 32);
The general syntax for TABLESAMPLE is TABLESAMPLE(BUCKET x OUT OF y)
The sample size for the query is around 1/y. In addition, y needs to be a multiple or factor of the number of buckets specified for the table at table creation time. For example, if we change y to 16, the query becomes
SELECT avg(viewTime)
FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 16);
Then the sample size includes approximately 1 out of every 16 users (as the bucket column is userid). The table still has 32 buckets, but Hive tries to satisfy this query by processing buckets 1 and 17 together. On the other hand, if y is specified to be 64, Hive will execute the query on half of the data in one bucket. The value of x is only used to select which bucket to use. Under truly random sampling its value shouldn’t matter.
回答1:
Which part of it don't you understand?
When you create the table and bucket it using the clustered by
clause into 32 buckets (as an example), hive buckets your data into 32 buckets using deterministic hash functions. Then when you use TABLESAMPLE(BUCKET x OUT OF y)
, hive divides your buckets into groups of y buckets and then picks the x'th bucket of each group. For example:
If you use
TABLESAMPLE(BUCKET 6 OUT OF 8)
, hive would divide your 32 buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30.If you use
TABLESAMPLE(BUCKET 23 OUT OF 32)
, hive would divide your 32 buckets into groups of 32, resulting in only 1 group of 32 buckets, and then picks the 23rd bucket as your result.If you use
TABLESAMPLE(BUCKET 3 OUT OF 64)
, hive would divide your 32 buckets into groups of 64 buckets, resulting in 1 group of 64 "half-bucket"s and then picks the half-bucket that corresponds to the 3rd full-bucket.
来源:https://stackoverflow.com/questions/18781869/hive-buckets-understanding-tablesamplebucket-x-out-of-y