regarding the Hive commands that do not invoke underlying MapReduce jobs

倖福魔咒の 提交于 2020-01-17 12:10:12

问题


My understanding is that Hive is an SQL-like language that can perform database-related tasks by invoking underlying MapReduce programs. However, I learned that some Hive commands does not invoke MapReduce job. I am curious to know that what are these commands, and why they do not need to invoke MapReduce job.


回答1:


You are right, Hive uses MR jobs on the background to process the data. Wen you fire a SQL like query in hive, it converts it into various MR jobs on the background and gives you the result.

Having said that, There are very few queries that doesnt need MR jobs. for e.g

SEKECT * FROM table LIMIT 10;

If you see in the above query we dont need any data processing. All we need is just to read a few rows from a table.

So the above hive query doesnt fire a MR job

But if we slightly modify the above query.

SELECT COUNT(*) FROM table;

It will fire MR jobs. Because we need to read all the data for this query and MR job will do it for us quickly(parallel processing)




回答2:


Since hive table is stored in the form of a file in HDFS,processing time and effort are saved by hive for operations like 'Select *' , 'Select * limit' by avoiding mapreduce calls and directly fetching the whole file or a part of the file from hdfs and displaying to the user.

Anyway, this default behavior can also be changed by modifying hive-site.xml hive.fetch.task.conversion property to invoke map-reduce programs for all the operations.



来源:https://stackoverflow.com/questions/29337451/regarding-the-hive-commands-that-do-not-invoke-underlying-mapreduce-jobs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!