How does impala provide faster query response compared to hive

前端 未结 5 580
盖世英雄少女心
盖世英雄少女心 2020-12-22 17:56

I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared

5条回答
  •  再見小時候
    2020-12-22 18:19

    I can think o the following reasons why Impala is faster, especially on complex SELECT statements.

    • Cloudera Impala being a native query language, avoids startup overhead which is commonly seen in MapReduce/Tez based jobs (MapReduce programs take time before all nodes are running at full capacity). In Hive, every query has this problem of “cold start” whereas Impala daemon processes are started at boot time itself, always being ready to process a query.
    • For tables with a large volume of data and/or many partitions, retrieving all the metadata for a table can be time-consuming, taking minutes in some cases. Thus, each Impala node caches all of this metadata to reuse for future queries against the same table.
    • Apache Hive is fault tolerant whereas Impala does not support fault tolerance. When a hive query is run and if the DataNode goes down while the query is being executed, the output of the query will be produced as Hive is fault tolerant. However, that is not the case with Impala. If a query execution fails in Impala it has to be started all over again.

提交回复
热议问题