yarn

List all yarn application in hadoop cluster through java

纵饮孤独 提交于 2019-12-11 09:48:43
问题 On running command yarn application -list on my hadoop cluster, it returns list of applications running. I want to fetch this list using Java. Currently I am using yarnClient API <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn-client</artifactId> <version>2.7.0</version> </dependency> My code looks like : YarnConfiguration conf = new

How to set up Spark cluster on Windows machines?

不打扰是莪最后的温柔 提交于 2019-12-11 08:48:02
问题 I am trying to set up a Spark cluster on Windows machines. The way to go here is using the Standalone mode, right? What are the concrete disadvantages of not using Mesos or YARN? And how much pain would it be to use either one of those? Does anyone have some experience here? 回答1: FYI, I got an answer in the user-group: https://groups.google.com/forum/#!topic/spark-users/SyBJhQXBqIs The standalone mode is indeed the way to go. Mesos does not work under Windows and YARN probably neither. 回答2:

hadoop三节点完全分布式

六眼飞鱼酱① 提交于 2019-12-11 08:21:11
三个节点关闭防火墙: service iptables stop chkconfig iptables off 三个节点需要修改主机名: vim /etc/sysconfig/network 修改HOSTNAME属性,例如: HOSTNAME=hadoop01 重新生效: source /etc/sysconfig/network 三个节点需要进行IP和主机名的映射: vim /etc/hosts 添加: 192.168.229.131 hadoop01 192.168.229.132 hadoop02 192.168.229.133 hadoop03 注意:映射完成之后,三个节点的hosts文件应该是一样的 三个节点重启 reboot 三个节点之间来配置免密登陆: ssh-keygen 注意:这三句话在三个节点上都需要执行 ssh-copy-id root@hadoop01 ssh-copy-id root@hadoop02 ssh-copy-id root@hadoop03 执行完成之后,每一个节点测试一下是否能够免密登陆: ssh hadoop01 ssh hadoop02 ssh hadoop03 安装JDK 安装Zookeeper ***** 在第一个节点上 ***** 解压Hadoop的安装包: tar -xvf hadoop-2.7.1_64bit.tar.gz

How are Spark Executors launched if Spark (on YARN) is not installed on the worker nodes?

懵懂的女人 提交于 2019-12-11 08:07:40
问题 I have a question regarding Apache Spark running on YARN in cluster mode. According to this thread, Spark itself does not have to be installed on every (worker) node in the cluster. My problem is with the Spark Executors: In general, YARN or rather the Resource Manager is supposed to decide about resource allocation. Hence, Spark Executors could be launched randomly on any (worker) node in the cluster. But then, how can Spark Executors be launched by YARN if Spark is not installed on any

Yarn nodemanager not starting up. Getting no errors

Deadly 提交于 2019-12-11 07:43:33
问题 I have Hadoop 2.7.4 installed on Ubuntu 16.04. I'm trying to run it in Pseudo Mode. I have a '/hadoop' partition mounted for all my hadoop files, NameNode and DataNode files. My core-site.xml is: <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> My hdfs-site.xml is: <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>/hadoop/nodes/namenode</value> <

I am getting the executor running beyond memory limits when running big join in spark

隐身守侯 提交于 2019-12-11 07:18:23
问题 I am getting the following error in the driver of a big join on spark. We have 3 nodes with 32GB of ram and total input size of join is 150GB. (The same app is running properly when input file size is 50GB) I have set storage.memoryFraction to 0.2 and shuffle.memoryFraction to 0.2. But still keep on getting the running beyong physical limits error. 15/04/07 19:58:17 INFO yarn.YarnAllocator: Container marked as failed: container_1426882329798_0674_01_000002. Exit status: 143. Diagnostics:

Yarn api get applications by elapsedTime

牧云@^-^@ 提交于 2019-12-11 05:54:38
问题 Is there an easy way to query the yarn applications api to get applications which have run for more than x amount of time? Following url gives a list of apps, but doesn't look like it respects the elapsedTime parameter http://<RM_DOMAIN>:<RM_PORT>/ws/v1/cluster/apps?states=RUNNING&elapsedTime=200000 回答1: elapsedTime is not a supported Query Parameter. You can use jq to filter the apps that match the criteria. curl http://<RM_DOMAIN>:<RM_PORT>/ws/v1/cluster/apps?states=RUNNING | jq '.apps.app[

How to get memory and cpu usage by a Spark application?

笑着哭i 提交于 2019-12-11 05:13:55
问题 I want to get the average resource utilization of a spark job for monitoring purposes, how can I poll the resource ie cpu and memory utilization of a Spark Application.? 回答1: You may check the stderr log for completed Spark application. Go to Yarn Resource Manager. Click on an application ID and then "Logs" on the right side of appattempt_* line. Scroll to Log Type:stderr and click "Click here for the full log". Look at the log for something like this: "yarn.YarnAllocator: Will request 256

HDP 2.5: Spark History Server UI won't show incomplete applications

╄→尐↘猪︶ㄣ 提交于 2019-12-11 05:11:42
问题 I set-up a new Hadoop Cluster with Hortonworks Data Platform 2.5 . In the "old" cluster (installed HDP 2.4 ) I was able to see the information about running Spark jobs via the History Server UI by clicking the link show incomplete applications : Within the new installation this link opens the page, but it always sais No incomplete applications found! (when there's still an application running). I just saw, that the YARN ResourceManager UI shows two different kind of links in the "Tracking UI"

Hive on Spark: Failed to create spark client

让人想犯罪 __ 提交于 2019-12-11 04:43:44
问题 I'm trying to make Hive 2.1.1 on Spark 2.1.0 work on a single instance. I'm not sure that's the right approach. Currently I only have one instance so I can't build a cluster. When I run any insert query in hive, I get the error: hive> insert into mcus (id, name) values (1, 'ARM'); Query ID = server_20170223121333_416506b4-13ba-45a4-a0a2-8417b187e8cc Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=