Hive insert overwrite and Insert into are very slow with S3 external table

问题

I am using AWS EMR. I have created external tables pointing to S3 location.

The "INSERT INTO TABLE" and "INSERT OVERWRITE" statements are very slow when using destination table as external table pointing to S3. The main issue is that Hive first writes data to a staging directory and then moves the data to the original location.

Does anyone have a better solution for this? Using S3 is really slowing down our jobs.

Cloudera recommends to use the setting hive.mv.files.threads. But looks like the setting is not available in Hive provided in EMR or Apache Hive.

Ok am trying to provide more details.

Below is my source table structure

CREATE EXTERNAL TABLE ORDERS  (
O_ORDERKEY       INT,
O_CUSTKEY        INT,
O_ORDERSTATUS    STRING,
O_TOTALPRICE     DOUBLE,
O_ORDERDATE      DATE,
O_ORDERPRIORITY  STRING,  
O_CLERK          STRING, 
O_SHIPPRIORITY   INT,
O_COMMENT        STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'  
STORED AS TEXTFILE
LOCATION 's3://raw-tpch/orders/';

Below is the structure of destination table.

CREATE EXTERNAL TABLE ORDERS  (
O_ORDERKEY       INT,
 O_CUSTKEY INT,
 O_ORDERSTATUS STRING,
 O_TOTALPRICE decimal(12,2),
 O_ORDERPRIORITY STRING,
 O_CLERK STRING,
 O_SHIPPRIORITY INT,
 O_COMMENT STRING)
partitioned by (O_ORDERDATE string)
STORED AS PARQUET
LOCATION 's3://parquet-tpch/orders/';

The source table contains orders data for 2400 days. Size of table is 100 GB.so destination table is expected to have 2400 partitions. I have executed below insert statement.

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=500000000;
set hive.optimize.sort.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=2000;
set hive.load.dynamic.partitions.thread=20;
set hive.mv.files.thread=25;
set hive.blobstore.optimizations.enabled=false;
set parquet.compression=snappy;
INSERT into TABLE orders_parq partition(O_ORDERDATE) 
SELECT  O_ORDERKEY, O_CUSTKEY,
O_ORDERSTATUS, O_TOTALPRICE, 
O_ORDERPRIORITY, O_CLERK,
O_SHIPPRIORITY, O_COMMENT, 
O_ORDERDATE from orders;

The query completes it map and reduce part in 10 min but takes lot of time to move data from /tmp/hive/hadoop/b0eac2bb-7151-4e29-9640-3e7c15115b60/hive_2018-02-15_15-02-32_051_5904274475440081364-1/-mr-10001 to destination s3 path.

If i enable the parameter "set hive.blobstore.optimizations.enabled=false;"

it takes time for moving data from hive staging directory to destination table directory.

Surprisingly i found one more issue even though i set my compression as snappy the output table size is 108GB more that raw input text file which is 100 GB.

来源：https://stackoverflow.com/questions/48797302/hive-insert-overwrite-and-insert-into-are-very-slow-with-s3-external-table

标签

amazon-s3

Hive

amazon-emr