hive-configuration

Hive number of reducers in group by and count(distinct)

梦想与她 提交于 2021-02-04 21:09:57
问题 I was told that count(distinct ) may result in data skew because only one reducer is used. I made a test using a table with 5 billion data with 2 queries, Query A: select count(distinct columnA) from tableA Query B: select count(columnA) from (select columnA from tableA group by columnA) a Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected. However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the

Hive number of reducers in group by and count(distinct)

老子叫甜甜 提交于 2021-02-04 21:09:17
问题 I was told that count(distinct ) may result in data skew because only one reducer is used. I made a test using a table with 5 billion data with 2 queries, Query A: select count(distinct columnA) from tableA Query B: select count(columnA) from (select columnA from tableA group by columnA) a Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected. However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the

Hive number of reducers in group by and count(distinct)

守給你的承諾、 提交于 2021-02-04 21:09:04
问题 I was told that count(distinct ) may result in data skew because only one reducer is used. I made a test using a table with 5 billion data with 2 queries, Query A: select count(distinct columnA) from tableA Query B: select count(columnA) from (select columnA from tableA group by columnA) a Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected. However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the

Why is Fetch task in Hive works faster than Map-only task?

|▌冷眼眸甩不掉的悲伤 提交于 2020-07-29 07:57:45
问题 It is possible to enable Fetch task in Hive for simple query instead of Map or MapReduce using hive hive.fetch.task.conversion parameter. Please explain why Fetch task is running much faster than Map especially when doing some simple work (for example select * from table limit 10; )? What map-only task is doing additionally in this case? The performance difference is more than 20 times faster in my case. Both tasks should read the table data, isn't it? 回答1: FetchTask directly fetches data,

Why is Fetch task in Hive works faster than Map-only task?

随声附和 提交于 2020-07-29 07:57:09
问题 It is possible to enable Fetch task in Hive for simple query instead of Map or MapReduce using hive hive.fetch.task.conversion parameter. Please explain why Fetch task is running much faster than Map especially when doing some simple work (for example select * from table limit 10; )? What map-only task is doing additionally in this case? The performance difference is more than 20 times faster in my case. Both tasks should read the table data, isn't it? 回答1: FetchTask directly fetches data,