Why is Fetch task in Hive works faster than Map-only task?

|▌冷眼眸甩不掉的悲伤 提交于 2020-07-29 07:57:45

问题


It is possible to enable Fetch task in Hive for simple query instead of Map or MapReduce using hive hive.fetch.task.conversion parameter.

Please explain why Fetch task is running much faster than Map especially when doing some simple work (for example select * from table limit 10;)? What map-only task is doing additionally in this case? The performance difference is more than 20 times faster in my case. Both tasks should read the table data, isn't it?


回答1:


FetchTask directly fetches data, whereas Mapreduce will invoke a map reduce job

<property>
  <name>hive.fetch.task.conversion</name>
  <value>minimal</value>
  <description>
    Some select queries can be converted to single FETCH task 
    minimizing latency.Currently the query should be single 
    sourced not having any subquery and should not have
    any aggregations or distincts (which incurrs RS), 
    lateral views and joins.
    1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
    2. more    : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)
  </description>
</property>

Also there is another parameter hive.fetch.task.conversion.threshold which by default in 0.10-0.13 is -1 and >0.14 is 1G(1073741824) This indicates that, If table size is greater than 1G use Mapreduce instead of Fetch task

more detail



来源:https://stackoverflow.com/questions/39894681/why-is-fetch-task-in-hive-works-faster-than-map-only-task

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!