Hive query performance is slow when using Hive date functions instead of hardcoded date strings?

徘徊边缘 提交于 2019-12-21 20:54:08

问题


I have a transaction table table_A that gets updated every day. Every day I insert new data into table_A from external table_B using the file_date field to filter the necessary data from external table_B to insert into table_A. However, there's a huge performance difference if I use a hardcoded date vs. using the Hive date functions:

-- Fast version (~20 minutes)
SET date_ingest = '2016-12-07';
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;

INSERT
    INTO
        TABLE
            table_A PARTITION (FILE_DATE) SELECT
                    id, eventtime
                    ,CONCAT_WS( '-' ,substr ( eventtime ,0 ,4 ) ,SUBSTRING( eventtime ,5 ,2 ) ,SUBSTRING( eventtime ,7 ,2 ) )
                FROM
                    table_B
                WHERE
                    file_date = ${hiveconf:date_ingest}
;

compared to:

-- Slow version (~9 hours)
SET date_ingest = date_add(to_date(from_unixtime( unix_timestamp( ) )),-1);
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;

INSERT
    INTO
        TABLE
            table_A PARTITION (FILE_DATE) SELECT
                    id, eventtime
                    ,CONCAT_WS( '-' ,substr ( eventtime ,0 ,4 ) ,SUBSTRING( eventtime ,5 ,2 ) ,SUBSTRING( eventtime ,7 ,2 ) )
                FROM
                    table_B
                WHERE
                    file_date = ${hiveconf:date_ingest}
;

Has anyone experienced similar issues? You should assume that I don't have access to the Unix hive command (i.e. can't use --hiveconf options) since we're using a third party UI.


回答1:


Sometimes partition pruning does not work when using functions in filter clause. If you calculate the variable in the wrapper shell script and pass it as -hiveconf variable to the Hive, it will work fine. Example:

#inside shell script
date_ingest=$(date -d '-1 day' +%Y-%m-%d)
hive -f your_script.hql -hiveconf date_ingest="$date_ingest" 

Then use it inside Hive script as WHERE file_date ='${hiveconf:date_ingest}'



来源:https://stackoverflow.com/questions/41048132/hive-query-performance-is-slow-when-using-hive-date-functions-instead-of-hardcod

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!