yarn

Suggestions required in increasing utilization of yarn containers on our discovery cluster

瘦欲@ 提交于 2019-12-11 16:57:56
问题 Current Setup we have our 10 node discovery cluster. Each node of this cluster has 24 cores and 264 GB ram Keeping some memory and CPU aside for background processes, we are planning to use 240 GB memory. now, when it comes to container set up, as each container may need 1 core, so max we can have 24 containers, each with 10GB memory. Usually clusters have containers with 1-2 GB memory but we are restricted with the available cores we have with us or maybe I am missing something Problem

Using for loop to pass arguments to command

放肆的年华 提交于 2019-12-11 16:38:37
问题 I have below command that produce this output Command: oozie job -info 0007218-170910003406158-oozie-oozi-W | grep job | awk '{print $3}' | cut -c1-23 | sed 's/job/application/' Output: application_1505017974932_23474 application_1505017974932_23478 application_1505017974932_23477 application_1505017974932_23473 application_1505017974932_23475 application_1505017974932_23471 application_1505017974932_23469 application_1505017974932_23476 application_1505017974932_23472 application

Oozie workflow hive action stuck in RUNNING

拈花ヽ惹草 提交于 2019-12-11 14:25:15
问题 I am running Hadoop 2.4.0, Oozie 4.0.0, Hive 0.13.0 from Hortonworks distro. I have multiple Oozie coordinator jobs that can potentially launch workflows all around the same time. The coordinator jobs each watch different directories and when the _SUCCESS files show up in those directories, the workflow would be launched. The workflow runs a Hive action that reads from external directory and copy stuff. SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; DROP

Hadoop MapReduce (Yarn) using hosts with different power/specifications

我是研究僧i 提交于 2019-12-11 12:09:03
问题 I currently have high power (cpu/ram) hosts in the cluster and we are considering to add some good storage but low power hosts. My concern is that it will reduce the jobs performance. Map/Reducers from the new (less powerful) hosts will run slower and the more powerful ones will just have to wait for the result. Is there a way to configure this in Yarn ? Maybe to set a priority for the hosts or to assign mapper/reducers according to the number of cores on each machines. Thanks, Horatiu 回答1:

hive add partition statement ignores leading zero

前提是你 提交于 2019-12-11 12:06:26
问题 I've folder on hdfs /user/test/year=2016/month=04/dt=25/000000_0 Need to add this above partition path to a test table. command : ALTER TABLE test ADD IF NOT EXISTS PARTITION (year=2016,month=04,dt=25) But this add partition command is ignoring the leading zero in the month partition and creates an extra folder inside 2016 as month=4. /user/test/year=2016/month=04/ /user/test/year=2016/month=4/ and table will be pointed to /user/test/year=2016/month=4/ this path which doesn't contain any data

spark throws exception when querying large amount of data in mysql

流过昼夜 提交于 2019-12-11 11:54:53
问题 ## when i submit my task to do some query from mysql to yarn by using spark's cluster mode like below: ## ./spark-submit --class org.com.scala.test.ScalaTestFile --master yarn --deploy-mode cluster --driver-memory 8g --executor-memory 5g --jars /usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar,/usr/local/spark/lib/datanucleus-core-3.2.10.jar,/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar,/usr/local/spark/lib/mysql-connector-java-5.1.26-bin.jar /data/tmp/snodawn/svn/scalaScript

Spark on Yarn: How to prevent multiple spark jobs being scheduled

大憨熊 提交于 2019-12-11 11:32:58
问题 With spark on yarn - I dont see a way to prevent concurrent jobs being scheduled. I have my architecture setup for doing purely batch processing. I need this for the following reasons: Resource Constraints UserCache for spark grows really quickly. Having multiple jobs run causes an explosion of space on cache. Ideally I'd love to see if there is a config that would ensure only one job to run at any time on Yarn. 回答1: You can run create a queue which can host only one application master and

How to improve performance of loading data from NON Partition table into ORC partition table in HIVE

喜夏-厌秋 提交于 2019-12-11 11:15:37
问题 I'm new to Hive Querying, I'm looking for best practices to retrieve data from Hive table. we have enabled TeZ has execution engine and enabled vectorization. We want to make reporting from Hive table, I read from TEZ document that it can be used for real time reporting. Scenario is from my WEB Application, I would like to show result from Hive Query Select * from Hive table on UI, but for any query, in the hive command prompt takes minimum 20-60 secs even though hive table has 60 GB data ,.

Can't kill YARN apps using ResourceManager UI after HDP 3.1.0.0-78 upgrade

旧巷老猫 提交于 2019-12-11 11:06:31
问题 I recently upgraded HDP from 2.6.5 to 3.1.0, which runs YARN 3.1.0, and I can no longer kill applications from the YARN ResourceManager UI, using either the old (:8088/cluster/apps) or new (:8088/ui2/index.html#/yarn-apps/apps) version. I can still kill them using the shell in RHEL 7 with yarn app -kill {app-id} These applications are submitted via Livy. Here is my workflow: Open the ResourceManagerUI, open the Application, click Settings and choose Kill Application. Notice, the 'Logged in as

Does Spark on yarn deal with Data locality while launching executors

本小妞迷上赌 提交于 2019-12-11 10:28:01
问题 I am considering static allocation of spark executor. Does Spark on yarn consider Data locality of raw input dataset getting used in spark application while launching executors. If it does take care of this how it does so as spark executor are requested and allocated when spark context gets initialized. There could be a chance that multiple raw input data set getting used in the spark application which could physically reside on many different data node. we can't run executor on all those