yarn | 易学教程

浅谈Yarn的资源调度

阅读更多关于浅谈Yarn的资源调度

1. 初代MR的不足 (1.JobTracker要承担的任务过大，维护Job状态的同时又要维护job的task状态，造成过多资源消耗在Tarcker。2.在TaskTracker端会消耗大量的资源用于调度整合，容易出现OOM。3.把资源强制划分为M Slot / R Slot ） 1、扩展性差　　　　2、可靠性低　　　　3、资源利用率低　　　　4、不支持多种计算框架 2. 首先我们要知道Yarn是什么？ Yarn是一个资源调度平台(它是Hadoop2以上的新特性)，负责为运算程序提供服务器运算资源，相当于一个分布式的操作系统平台，而MR等运算程序则相当于运行于操作系统之上的应用程序。它弥补的初代MR编程框架的不足。Yarn不知程序的运行机制，只提供运算资源的调度，主管角色ResourceManager，运算资源角色NodeManager。 3. Yarn的优势作为资源的调度者，它于运行程序完全解耦，就意味这Yarn上可以运行各种类型的分布式计算程序（MapReduce，Storm，Spark等）。 1： Yarn有两个单独的守护进程，ResourceManager和ApplicationMaster。 2: ResourceManager与NodeManager组成了基本的数据计算框架，ResourceManager负责协调集群的资源利用。 4. Yarn的基本框架来源：

YARN的架构及原理

阅读更多关于 YARN的架构及原理

1. YARN产生背景 MapReduce本身存在着一些问题： 1）JobTracker单点故障问题；如果Hadoop集群的JobTracker挂掉，则整个分布式集群都不能使用了。 2）JobTracker承受的访问压力大，影响系统的扩展性。 3）不支持MapReduce之外的计算框架，比如Storm、Spark、Flink等。与旧MapReduce相比，YARN采用了一种分层的集群框架，具有以下几种优势。 1）Hadoop2.0提出了HDFSFederation；它让多个NameNode分管不同的目录进而实现访问隔离和横向扩展。对于运行中NameNode的单点故障，通过 NameNode热备方案（NameNode HA）实现。 2） YARN通过将资源管理和应用程序管理两部分剥离开来，分别由ResourceManager和ApplicationMaster进程来实现。其中，ResouceManager专管资源管理和调度，而ApplicationMaster则负责与具体应用程序相关的任务切分、任务调度和容错等。 3）YARN具有向后兼容性，用户在MR1上运行的作业，无需任何修改即可运行在YARN之上。 4）对于资源的表示以内存为单位（在目前版本的 Yarn 中没有考虑 CPU的占用），比之前以剩余 slot 数目为单位更合理。 5）支持多个框架，YARN不再是一个单纯的计算框架

“Bad substitution” when submitting spark job to yarn-cluster

阅读更多关于 “Bad substitution” when submitting spark job to yarn-cluster

问题 I am doing a smoke test against a yarn cluster using yarn-cluster as the master with the SparkPi example program. Here is the command line: $SPARK_HOME/bin/spark-submit --master yarn-cluster --executor-memory 8G --executor-cores 240 --class org.apache.spark.examples.SparkPi examples/target/scala-2.11/spark-examples-1.4.1-hadoop2.7.1.jar The yarn accepts the job but then complains about a "bad substitution" . Maybe it is on the hdp.version ?? 15/09/01 21:54:05 INFO yarn.Client: Application

Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

阅读更多关于 Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

问题 A help for the implementation best practice is needed. The operating environment is as follows: Log data file arrives irregularly. The size of a log data file is from 3.9KB to 8.5MB. The average is about 1MB. The number of records of a data file is from 13 lines to 22000 lines. The average is about 2700 lines. Data file must be post-processed before aggregation. Post-processing algorithm can be changed. Post-processed file is managed separately with original data file, since the post

How are containers created based on vcores and memory in MapReduce2?

阅读更多关于 How are containers created based on vcores and memory in MapReduce2?

问题 I have a tiny cluster composed of 1 master (namenode, secondarynamenode, resourcemanager) and 2 slaves (datanode, nodemanager). I have set in the yarn-site.xml of the master : yarn.scheduler.minimum-allocation-mb : 512 yarn.scheduler.maximum-allocation-mb : 1024 yarn.scheduler.minimum-allocation-vcores : 1 yarn.scheduler.maximum-allocation-vcores : 2 I have set in the yarn-site.xml of the slaves : yarn.nodemanager.resource.memory-mb : 2048 yarn.nodemanager.resource.cpu-vcores : 4 Then in the

WARN cluster.YarnScheduler: Initial job has not accepted any resources

阅读更多关于 WARN cluster.YarnScheduler: Initial job has not accepted any resources

问题 Any spark jobs that I run will fail with the following error message 17/06/16 11:10:43 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Spark version is 1.6, running on Yarn. I am issuing jobs from pyspark. And you can notice from the job timeline that it runs indefinitely and no resources are added or removed.1 回答1: First point is that if there are enough resources such as nodes,

How to use external (custom) package in pyspark?

阅读更多关于 How to use external (custom) package in pyspark?

问题 I am trying to replicate the soultion given here https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_python.html to import external packages in pypspark. But it is failing. My code: spark_distro.py from pyspark import SparkContext, SparkConf def import_my_special_package(x): from external_package import external return external.fun(x) conf = SparkConf() sc = SparkContext() int_rdd = sc.parallelize([1, 2, 3, 4]) int_rdd.map(lambda x: import_my_special_package(x)).collect()

What is Memory reserved on Yarn

阅读更多关于 What is Memory reserved on Yarn

问题 I managed to launch a spark application on Yarn. However emory usage is kind of weird as you can see below : http://imgur.com/1k6VvSI What does memory reserved mean ? How can i manage to efficiently use all the memory available ? Thanks in advance. 回答1: Check out this blog from Cloudera that explains the new memory management in YARN. Here's the pertinent bits: ... An implementation detail of this change that prevents applications from starving under this new flexibility is the notion of

搭建Hadoop2.0（二）hadoop环境配置

阅读更多关于搭建Hadoop2.0（二）hadoop环境配置

1.Hadoop2.0 简述 [1] 与之前的稳定的hadoop-1.x相比，Apache Hadoop 2.x有较为显著的变化。这里给出在HDFS和MapReduce两方面的改进。　　HDFS:为了保证name服务器的规模水平，开发人员使用了多个独立的Namenodes和Namespaces。这些Namenode是联合起来的，它们之间不需要相互协调。Datanode可以为所有Namenode存放数据块，每个数据块要在平台上所有的Namenode上进行注册。Datenode定期向Namenode发送心跳信号和数据报告，接受和处理Namenodes的命令。　　YARN(新一代MapReduce)：在hadoop-0.23中介绍的新架构，将JobTracker的两个主要的功能：资源管理和作业生命周期管理分成不同的部分。新的资源管理器负责管理面向应用的计算资源分配和每个应用的之间的调度及协调。　　每个新的应用既是一个传统意义上的MapReduce作业，也是这些作业的 DAG(Database Availability Group数据可用性组)，资源管理者（ResourcesManager）和管理每台机器的数据管理者（NodeManager）构成了整个平台的计算布局。　　每一个应用的应用管理者实际上是一个架构的数据库，向资源管理者（ResourcesManager）申请资源

YARN的内存和CPU配置

阅读更多关于 YARN的内存和CPU配置

时间 2015-06-05 00:00:00 JavaChen's Blog 原文 http://blog.javachen.com/2015/06/05/yarn-memory-and-cpu-configuration.html 主题 YARN Hadoop YARN同时支持内存和CPU两种资源的调度，本文介绍如何配置YARN对内存和CPU的使用。 YARN作为一个资源调度器，应该考虑到集群里面每一台机子的计算资源，然后根据application申请的资源进行分配Container。Container是YARN里面资源分配的基本单位，具有一定的内存以及CPU资源。在YARN集群中，平衡内存、CPU、磁盘的资源的很重要的，根据经验，每两个container使用一块磁盘以及一个CPU核的时候可以使集群的资源得到一个比较好的利用。内存配置关于内存相关的配置可以参考hortonwork公司的文档 Determine HDP Memory Configuration Settings 来配置你的集群。 YARN以及MAPREDUCE所有可用的内存资源应该要除去系统运行需要的以及其他的hadoop的一些程序，总共保留的内存=系统内存+HBASE内存。可以参考下面的表格确定应该保留的内存：每台机子内存系统需要的内存 HBase需要的内存 4GB 1GB 1GB 8GB 2GB

订阅 yarn