apache-tez

Hive Tez reducers are running super slow

风格不统一 提交于 2020-01-04 02:29:13
问题 I have joined multiple tables and the total no of rows are around 25 billion. On top of that, I am doing aggregation. Here are my hive settings as below, which I am using to generate the final output. I am not really sure how to tune the query and make it run faster. Currently, I am doing trial and error and see if that can produce some results but that doesn't seem to be working.Mappers are running faster but reducers are taking forever to finish off. Could anyone share your thoughts on this

How do I fix this Kryo exception when using a UDF on hive?

倖福魔咒の 提交于 2019-12-24 19:29:50
问题 I have a hive query that worked in hortonworks 2.6 sandbox, but it doesn't work on sandbox ver. 3.0 because of this exception: Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 95 Serialization trace: parentOperators (org.apache.hadoop.hive.ql.exec.vector.reducesink.VectorReduceSinkLongOperator) childOperators (org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)

Tez job fails when submitting by different user

血红的双手。 提交于 2019-12-23 04:20:28
问题 Configured Hadoop-2.6.0 HA cluster with Kerberos security. When submitting example job using tez-example-0.6.0.jar in yarn-tez framework from different user, getting the below exception Exception java.io.IOException: The ownership on the staging directory hdfs://clustername/tmp/staging is not as expected. It is owned by Kumar. The directory must be owned by the submitter TestUser or by TestUser The directory has full permission but still getting the above exception. But when submitting a job

How can I fix java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.1.0?

徘徊边缘 提交于 2019-12-11 17:03:32
问题 I get a java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.1.0 exception in my query. Here's the query: WITH t1 as (select * from browserdata join citydata on cityid=id), t2 as (select uap.device as device, uap.os as os, uap.browser as browser, name as cityname from t1 lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser), t3 as (select t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, count(*) as count from t2

Suggestions required in increasing utilization of yarn containers on our discovery cluster

瘦欲@ 提交于 2019-12-11 16:57:56
问题 Current Setup we have our 10 node discovery cluster. Each node of this cluster has 24 cores and 264 GB ram Keeping some memory and CPU aside for background processes, we are planning to use 240 GB memory. now, when it comes to container set up, as each container may need 1 core, so max we can have 24 containers, each with 10GB memory. Usually clusters have containers with 1-2 GB memory but we are restricted with the available cores we have with us or maybe I am missing something Problem

Diffrence in behaviour while running “count(*) ” in Tez and Map reduce

﹥>﹥吖頭↗ 提交于 2019-12-11 08:04:07
问题 Recently I came across this issue. I had a file at a Hadoop Distributed File System path and related hive table. The table had 30 partitions on both sides. I deleted 5 partitions from HDFS and then executed "msck repair table <db.tablename>;" on the hive table. It completed fine but outputted "Partitions missing from filesystem:" I tried running select count(*) <db.tablename>; (on tez) it failed with the following error: Caused by: java.util.concurrent.ExecutionException: java.io

Why hdfs throwing LeaseExpiredException in Hadoop cluster (AWS EMR)

半腔热情 提交于 2019-12-10 16:38:45
问题 I am getting LeaseExpiredException in hadoop cluster - tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-ip-172-30-2-148.log 2016-09-21 11:54:14,533 INFO BlockStateChange (IPC Server handler 10 on 8020): BLOCK* InvalidateBlocks: add blk_1073747501_6677 to 172.30.2.189:50010 2016-09-21 11:54:14,534 INFO org.apache.hadoop.ipc.Server (IPC Server handler 31 on 8020): IPC Server handler 31 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 172.30.2.189:37674 Call#34 Retry#0

How to reduce generating files of SQL “Alter Table/Partition Concatenate” in Hive?

吃可爱长大的小学妹 提交于 2019-12-07 04:39:22
问题 Hive version: 1.2.1 Configuration: set hive.execution.engine=tez; set hive.merge.mapredfiles=true; set hive.merge.smallfiles.avgsize=256000000; set hive.merge.tezfiles=true; HQL: ALTER TABLE `table_name` PARTITION (partion_name1 = 'val1', partion_name2='val2', partion_name3='val3', partion_name4='val4') CONCATENATE; I use the HQL to merge files of specific table / partition. However, after execution there are still many files in output directory; and their size are far less than 256000000. So

Map-Reduce Logs on Hive-Tez

自古美人都是妖i 提交于 2019-12-06 05:19:24
问题 I want to get the interpretation of Map-Reduce logs after running a query on Hive-Tez ? What the lines after INFO: conveys ? Here I have attached a sample INFO : Session is already open INFO : Dag name: SELECT a.Model...) INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO : INFO : Status: Running (Executing on YARN cluster with App id application_14708112341234_1234) INFO : Map 1: -/- Map 3: -/- Map 4: -/- Map 7: -/- Reducer 2: 0/15 Reducer 5: 0/26 Reducer 6: 0/13

Is Hive faster than Spark?

风流意气都作罢 提交于 2019-12-06 03:32:51
问题 After reading What is hive, Is it a database?, a colleague yesterday mentioned that he was able to filter a 15B table, join it with another table after doing a "group by", which resulted in 6B records, in only 10 minutes! I wonder if this would be slower in Spark, since now with the DataFrames, they may be comparable, but I am not sure, thus the question. Is Hive faster than Spark? Or this question doesn't have meaning? Sorry, for my ignorance. He uses the latest Hive, which from seems to be