问题
I have a hive table, htable that's partitioned on foo and bar. I want to create a small subset of this table for experiments, so I would think the thing to do would be
create table new_table like htable;
insert into new_table partition (foo, bar) select * from htable
where rand() < 0.01 and foo in (a,b)
This takes forever however and finally fails with a java.lang.OutOfMemoryError: Java heap space. Is there a better way?
回答1:
Add distribute by foo, bar:
insert into new_table partition (foo, bar) select * from htable
where rand() < 0.01 and foo in (a,b)
distribute by foo, bar
this will reduce memory consumption.
来源:https://stackoverflow.com/questions/39272906/hive-how-to-efficiently-create-table-as-select