regarding the Hive commands that do not invoke underlying MapReduce jobs

问题

My understanding is that Hive is an SQL-like language that can perform database-related tasks by invoking underlying MapReduce programs. However, I learned that some Hive commands does not invoke MapReduce job. I am curious to know that what are these commands, and why they do not need to invoke MapReduce job.

回答1:

You are right, Hive uses MR jobs on the background to process the data. Wen you fire a SQL like query in hive, it converts it into various MR jobs on the background and gives you the result.

Having said that, There are very few queries that doesnt need MR jobs. for e.g

SEKECT * FROM table LIMIT 10;

If you see in the above query we dont need any data processing. All we need is just to read a few rows from a table.

So the above hive query doesnt fire a MR job

But if we slightly modify the above query.

SELECT COUNT(*) FROM table;

It will fire MR jobs. Because we need to read all the data for this query and MR job will do it for us quickly(parallel processing)

回答2:

Since hive table is stored in the form of a file in HDFS,processing time and effort are saved by hive for operations like 'Select *' , 'Select * limit' by avoiding mapreduce calls and directly fetching the whole file or a part of the file from hdfs and displaying to the user.

Anyway, this default behavior can also be changed by modifying hive-site.xml hive.fetch.task.conversion property to invoke map-reduce programs for all the operations.

来源：https://stackoverflow.com/questions/29337451/regarding-the-hive-commands-that-do-not-invoke-underlying-mapreduce-jobs

标签

Hadoop

MapReduce

Hive