hadoop2 | 易学教程

hadoop/yarn and task parallelization on non-hdfs filesystems

阅读更多关于 hadoop/yarn and task parallelization on non-hdfs filesystems

I've instantiated a Hadoop 2.4.1 cluster and I've found that running MapReduce applications will parallelize differently depending on what kind of filesystem the input data is on. Using HDFS, a MapReduce job will spawn enough containers to maximize use of all available memory. For example, a 3-node cluster with 172GB of memory with each map task allocating 2GB, about 86 application containers will be created. On a filesystem that isn't HDFS (like NFS or in my use case, a parallel filesystem), a MapReduce job will only allocate a subset of available tasks (e.g., with the same 3-node cluster,

Hadoop gen1 vs Hadoop gen2

阅读更多关于 Hadoop gen1 vs Hadoop gen2

I am a bit confused about place of tasktracker in Hadoop-2.x. Daemons in Hadoop-1.x are namenode, datanode, jobtracker, taskracker and secondarynamenode Daemons in Hadoop-2.x are namenode, datanode, resourcemanager, applicationmaster, secondarynamenode. This means Jobtracker has split up into: resourcemanager and applicationmaster So where is tasktracker ? In YARN (the new execution framework in Hadoop 2), MapReduce doesn't exist in the way it did before. YARN is a more general purpose way to allocate resources on the cluster. ResourceManager, ApplicationMaster, and NodeManager now consist of

name node Vs secondary name node

阅读更多关于 name node Vs secondary name node

问题 Hadoop is Consistent and partition tolerant, i.e. It falls under the CP category of the CAP theoram. Hadoop is not available because all the nodes are dependent on the name node. If the name node falls the cluster goes down. But considering the fact that the HDFS cluster has a secondary name node why cant we call hadoop as available. If the name node is down the secondary name node can be used for the writes. What is the major difference between name node and secondary name node that makes

Increase number of Hive mappers in Hadoop 2

阅读更多关于 Increase number of Hive mappers in Hadoop 2

I created a HBase table from Hive and I'm trying to do a simple aggregation on it. This is my Hive query: from my_hbase_table select col1, count(1) group by col1; The map reduce job spawns only 2 mappers and I'd like to increase that. With a plain map reduce job I would configure the yarn and mapper memory to increase the number of mappers. I tried the following in Hive but it did not work: set yarn.nodemanager.resource.cpu-vcores=16; set yarn.nodemanager.resource.memory-mb=32768; set mapreduce.map.cpu.vcores=1; set mapreduce.map.memory.mb=2048; NOTE: My test cluster has only 2 nodes The HBase

In spark, how does broadcast work?

阅读更多关于 In spark, how does broadcast work?

This is a very simple question: in spark, broadcast can be used to send variables to executors efficiently. How does this work ? More precisely: when are values sent : as soon as I call broadcast , or when the values are used ? Where exactly is the data sent : to all executors, or only to the ones that will need it ? where is the data stored ? In memory, or on disk ? Is there a difference in how simple variables and broadcast variables are accessed ? What happens under the hood when I call the .value method ? Short answer Values are sent the first time they are needed in an executor. Nothing

How to tune spark job on EMR to write huge data quickly on S3

阅读更多关于 How to tune spark job on EMR to write huge data quickly on S3

I have a spark job where i am doing outer join between two data frames . Size of first data frame is 260 GB,file format is text files which is split into 2200 files and the size of second data frame is 2GB . Then writing data frame output which is about 260 GB into S3 takes very long time is more than 2 hours after that i cancelled because i have been changed heavily on EMR . Here is my cluster info . emr-5.9.0 Master: m3.2xlarge Core: r4.16xlarge 10 machines (each machine has 64 vCore, 488 GiB memory,EBS Storage:100 GiB) This is my cluster config that i am setting capacity-scheduler yarn

hdfs command is deprecated in hadoop

阅读更多关于 hdfs command is deprecated in hadoop

问题 As I am following below procedure: http://www.codeproject.com/Articles/757934/Apache-Hadoop-for-Windows-Platform https://www.youtube.com/watch?v=VhxWig96dME. While executing the command c :/hadoop-2.3.0/bin/hadoop namenode -format , I got the error message given below **DEPRECATED:Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Exception in thread "main" java.lang.NoClassDefFountError** I am using jdk-6-windows-amd64.exe . How to solve this issue

hdfs command is deprecated in hadoop

阅读更多关于 hdfs command is deprecated in hadoop

As I am following below procedure: http://www.codeproject.com/Articles/757934/Apache-Hadoop-for-Windows-Platform https://www.youtube.com/watch?v=VhxWig96dME . While executing the command c :/hadoop-2.3.0/bin/hadoop namenode -format , I got the error message given below **DEPRECATED:Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Exception in thread "main" java.lang.NoClassDefFountError** I am using jdk-6-windows-amd64.exe . How to solve this issue ? use the cmd c:/hadoop-2.3.0/bin/hdfs to replace c:/hadoop-2.3.0/bin/hadoop A lot of hdfs cmds are

NULL column names in Hive query result

阅读更多关于 NULL column names in Hive query result

问题 I have downloaded the weather .txt files from NOAA, which looks like: WBAN,Date,Time,StationType,SkyCondition,SkyConditionFlag,Visibility,VisibilityFlag,WeatherType,WeatherTypeFlag,DryBulbFarenheit,DryBulbFarenheitFlag,DryBulbCelsius,DryBulbCelsiusFlag,WetBulbFarenheit,WetBulbFarenheitFlag,WetBulbCelsius,WetBulbCelsiusFlag,DewPointFarenheit,DewPointFarenheitFlag,DewPointCelsius,DewPointCelsiusFlag,RelativeHumidity,RelativeHumidityFlag,WindSpeed,WindSpeedFlag,WindDirection,WindDirectionFlag

Yarn : yarn-site.xml changes not taking effect

阅读更多关于 Yarn : yarn-site.xml changes not taking effect

We have a spark streaming application running on HDFS 2.7.3 with Yarn as the resource manager....while running the application .. these two folders /tmp/hadoop/data/nm-local-dir/filecache /tmp/hadoop/data/nm-local-dir/filecache are filling up and hence the disk ..... so from my research found that configuring these two properties in yarn-site.xml will help <property> <name>yarn.nodemanager.localizer.cache.cleanup.interval-ms</name> <value>2000</value> </property> <property> <name>yarn.nodemanager.localizer.cache.target-size-mb</name> <value>2048</value> </property> i have configured them on