Why does a map only job in hive results in a single output file

问题

When I execute the following query, I get only one file as output although I have 8 mappers and 0 reducers.

create table table_2 as select * from table_1.

8 mappers are invoked and there is no reducer phase. There is just only one file in the location of table_2, shouldn't there be 8 files as we have 8 mappers and 0 reducers.

回答1:

From Hive documentation, Configuration Properties...

hive.merge.mapfiles
  Default Value: true
  Merge small files at the end of a map-only job.

hive.merge.tezfiles
  Default Value: false
  Merge small files at the end of a Tez DAG

hive.merge.smallfiles.avgsize
  Default Value: 16000000
  When the average output file size of a job is less than this number,
  Hive will start an additional map-reduce job to merge the output files into bigger files...

So, if (a) your test dataset is very small and (b) you don't use TEZ but plain old MapReduce, then Hive will run a post-Map step just to merge the (intermediate) results, by default.

Whereas it would not happen after a Reduce step, unless you force hive.merge.mapredfiles to true.

来源：https://stackoverflow.com/questions/47272492/why-does-a-map-only-job-in-hive-results-in-a-single-output-file

标签

Hadoop

Hive

MapReduce

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!