yarn

Hadoop(24)-Hadoop优化

徘徊边缘 提交于 2019-12-18 04:22:06
1. MapReduce 跑得慢的原因 优化方法 MapReduce优化方法主要从六个方面考虑:数据输入、Map阶段、Reduce阶段、IO传输、数据倾斜问题和常用的调优参数。 数据输入 Map阶段 Reduce阶段 I/O传输 数据倾斜 数据倾斜现象 减小数据倾斜的方法 常用的调优参数 资源相关 以下参数是在用户自己的MR应用程序中配置就可以生效(mapred-default.xml) 配置参数 参数说明 mapreduce.map.memory.mb 一个MapTask可使用的资源上限(单位:MB),默认为1024。如果MapTask实际使用的资源量超过该值,则会被强制杀死。 mapreduce.reduce.memory.mb 一个ReduceTask可使用的资源上限(单位:MB),默认为1024。如果ReduceTask实际使用的资源量超过该值,则会被强制杀死。 mapreduce.map.cpu.vcores 每个MapTask可使用的最多cpu core数目,默认值: 1 mapreduce.reduce.cpu.vcores 每个ReduceTask可使用的最多cpu core数目,默认值: 1 mapreduce.reduce.shuffle.parallelcopies 每个Reduce去Map中取数据的并行数。默认值是5 mapreduce.reduce

Spark集群的任务提交执行流程

核能气质少年 提交于 2019-12-18 00:53:01
本文转自:https://www.linuxidc.com/Linux/2018-02/150886.htm 一、Spark on Standalone 1.spark集群启动后,Worker向Master注册信息 2.spark-submit命令提交程序后,driver和application也会向Master注册信息 3.创建SparkContext对象:主要的对象包含DAGScheduler和TaskScheduler 4.Driver把Application信息注册给Master后,Master会根据App信息去Worker节点启动Executor 5.Executor内部会创建运行task的线程池,然后把启动的Executor反向注册给Dirver 6.DAGScheduler:负责把Spark作业转换成Stage的DAG(Directed Acyclic Graph有向无环图),根据宽窄依赖切分Stage,         然后把Stage封装成TaskSet的形式发送个TaskScheduler;         同时DAGScheduler还会处理由于Shuffle数据丢失导致的失败; 7.TaskScheduler:维护所有TaskSet,分发Task给各个节点的Executor(根据数据本地化策略分发Task),监控task的运行状态,负责重试失败的task;

How to submit a spark job on a remote master node in yarn client mode?

对着背影说爱祢 提交于 2019-12-17 23:43:21
问题 I need to submit spark apps/jobs onto a remote spark cluster. I have currently spark on my machine and the IP address of the master node as yarn-client. Btw my machine is not in the cluster. I submit my job with this command ./spark-submit --class SparkTest --deploy-mode client /home/vm/app.jar I have the address of my master hardcoded into my app in the form val spark_master = spark://IP:7077 And yet all I get is the error 16/06/06 03:04:34 INFO AppClient$ClientEndpoint: Connecting to master

Hadoop: Connecting to ResourceManager failed

别说谁变了你拦得住时间么 提交于 2019-12-17 23:26:44
问题 After installing hadoop 2.2 and trying to launch pipes example ive got the folowing error (the same error shows up after trying to launch hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount someFile.txt /out ): /usr/local/hadoop$ hadoop pipes -Dhadoop.pipes.java.recordreader=true -Dhadoop.pipes.java.recordwriter=true -input someFile.txt -output /out -program bin/wordcount DEPRECATED: Use of this script to execute mapred command is deprecated. Instead use the mapred command for it. 13/12

Spark Fixed Width File Import Large number of columns causing high Execution time

冷暖自知 提交于 2019-12-17 21:36:33
问题 I am getting the fixed width .txt source file from which I need to extract the 20K columns. As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files. Code read the text file as RDD with sparkContext.textFile("abc.txt") then reads JSON schema and gets the column names and width of each column. In the function I read the fixed length string and using the start and end position we use substring function to

Hive tables not found when running in YARN-Cluster mode

↘锁芯ラ 提交于 2019-12-17 20:28:43
问题 I have a Spark (version 1.4.1) application on HDP 2.3. It works fine when running it in YARN-Client mode. However, when running it on YARN-Cluster mode none of my Hive tables can be found by the application. I submit the application like so: ./bin/spark-submit --class com.myCompany.Main --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 10g --executor-cores 1 --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar

Hadoop is not showing my job in the job tracker even though it is running

二次信任 提交于 2019-12-17 19:39:09
问题 Problem: When I submit a job to my hadoop 2.2.0 cluster it doesn't show up in the job tracker but the job completes successfully. By this I can see the output and it is running correctly and prints output as it is running. I have tried muliple options but the job tracker is not seeing the job. If I run a streaming job using the 2.2.0 hadoop it shows up in the task tracker but when I submit it via the hadoop-client api it does not show up in the job tracker. I am looking at the ui interface on

Oozie shell action memory limit

回眸只為那壹抹淺笑 提交于 2019-12-17 16:25:20
问题 We have an oozie workflow with a shell action that needs more memory than what a map task is given by Yarn by default. How can we give it more memory? We have tried adding the following configuration to the action: <configuration> <property> <name>mapreduce.map.memory.mb</name> <value>6144</value> <!-- for example --> </property> </configuration> We have both set this as an inline (in the workflow.xml) configuration and as a jobXml. Neither has had any effect. 回答1: We found the answer: A

Where are logs in Spark on YARN?

南笙酒味 提交于 2019-12-17 15:39:45
问题 I'm new to spark. Now I can run spark 0.9.1 on yarn (2.0.0-cdh4.2.1). But there is no log after execution. The following command is used to run a spark example. But logs are not found in the history server as in a normal MapReduce job. SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.0.0-cdh4.2.1.jar \ ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar ./spark-example-1.0.0.jar \ --class SimpleApp --args yarn-standalone --num-workers 3 --master-memory 1g \ --worker

How to log using log4j to local file system inside a Spark application that runs on YARN?

心不动则不痛 提交于 2019-12-17 15:23:46
问题 I'm building an Apache Spark Streaming application and cannot make it log to a file on the local filesystem when running it on YARN . How can achieve this? I've set log4.properties file so that it can successfully write to a log file in /tmp directory on the local file system (shown below partially): log4j.appender.file=org.apache.log4j.FileAppender log4j.appender.file.File=/tmp/application.log log4j.appender.file.append=false log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j