Is there a way to keep the duplicates in a collected set in Hive, or simulate the sort of aggregate collection that Hive provides using some other method? I want to aggregat
Here is the exact hive query that does this job (works only in hive > 0.13):
SELECT hash_id, collect_set( num_of_cats) FROM GROUP BY hash_id;
As of hive 0.13, there is a built-in UDAF called collect_list()
that achieves this. See here.
Check out the Brickhouse collect UDAF ( http://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CollectUDAF.java )
It also supports collecting into a map. Brickhouse also contains many useful UDF's not in the standard Hive distribution.