MapReduce | 易学教程

getting java.lang.ClassCastException: class java.lang.String in running a simple MapReduce Program

阅读更多关于 getting java.lang.ClassCastException: class java.lang.String in running a simple MapReduce Program

问题 I am trying to execute a simple MapReduce program, wherein the Map takes the input, splits it in two parts(key=> String and value=>Integer) The reducer sums up the values for a corresponding key I am getting ClassCastException everytime. I am not able to understand, what in the code is causing this error My Code: import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache

Problem with Hadoop Streaming -file option for Java class files

阅读更多关于 Problem with Hadoop Streaming -file option for Java class files

问题 I am struggling with a very basic issue in hadoop streaming in the "-file" option. First I tried the very basic example in streaming: hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer /bin/wc -inputformat KeyValueTextInputFormat -input gutenberg/* -output gutenberg-outputtstchk22 which worked absolutely fine. Then I copied the IdentityMapper.java source code and compiled it. Then I

Hadoop shuffle uses which protocol?

阅读更多关于 Hadoop shuffle uses which protocol?

问题 During the shuffle stage of Hadoop data the mapped data is transferred across nodes of the clusters according to the partitions for the reducer. What protocol does Hadoop use for performing the shuffle of data across nodes for the reduce stage? 回答1: I really laughed for the first time, but the whole shuffeling and merging is done by a HTTPServlet . You can see this in the Tasktrackers sourcecode in the anonymous class MapOutputServlet It gets a HTTP request with IDs of the tasks and jobs and

Protobuf RPC not available on Hadoop 2.2.0 single node server?

阅读更多关于 Protobuf RPC not available on Hadoop 2.2.0 single node server?

问题 I am trying to run a hadoop 2.2.0 mapreduce job on my local single node cluster installed by following this tutorial: http://codesfusion.blogspot.co.at/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1 Though on the server side the following exception is thrown: org.apache.hadoop.ipc.RpcNoSuchProtocolException: Unknown protocol: org.apache.hadoop.yarn.api.ApplicationClientProtocolPB at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.getProtocolImpl(ProtobufRpcEngine.java:527)

【hadoop】20.MapReduce-InputFormat数据切片机制

阅读更多关于【hadoop】20.MapReduce-InputFormat数据切片机制

简介通过本章节，您可以学习到： Job的提交流程 InputFormat数据切片的机制 1、Job提交流程源码分析 1）job提交流程源码详解 waitForCompletion() submit(); // 1建立连接 connect(); // 1）创建提交job的代理 new Cluster(getConfiguration()); // （1）判断是本地yarn还是远程 initialize(jobTrackAddr, conf); // 2 提交job submitter.submitJobInternal(Job.this, cluster) // 1）创建给集群提交数据的Stag路径 Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf); // 2）获取jobid ，并创建job路径 JobID jobId = submitClient.getNewJobID(); // 3）拷贝jar包到集群 copyAndConfigureFiles(job, submitJobDir); rUploader.uploadFiles(job, jobSubmitDir); // 4）计算切片，生成切片规划文件 writeSplits(job, submitJobDir); maps =

pseudo distributed number map and reduce tasks

阅读更多关于 pseudo distributed number map and reduce tasks

问题 I am newbie to Hadoop. I have successfully configured a hadoop setup in pseudo distributed mode. Now I would like to know what's the logic of choosing the number of map and reduce tasks. What do we refer to? Thanks 回答1: You cannot generalize how number of mappers/reducers are to be set. Number of Mappers: You cannot set number of mappers explicitly to a certain number(There are parameters to set this but it doesn't come into effect). This is decided by the number of Input Splits created by

MongoDB C# driver 2.0: How to get the result from MapReduceAsync

阅读更多关于 MongoDB C# driver 2.0: How to get the result from MapReduceAsync

问题 MongoDB C# driver 2.0: How to get the result from MapReduceAsync I'm using MongoDB version 3, C# driver 2.0 and would get the result of MapReduceAsync method. I have this collection "users": { "_id" : 1, "firstName" : "Rich", "age" : "18" } { "_id" : 2, "firstName" : "Rob", "age" : "25" } { "_id" : 3, "firstName" : "Sarah", "age" : "12" } The code in VisualStudio: var map = new BsonJavaScript( @" var map = function() { emit(NumberInt(1), this.age); };"); var reduce = new BsonJavaScript(@" var

How to normalize/reduce time data in mongoDB?

阅读更多关于 How to normalize/reduce time data in mongoDB?

问题 I'm storing minutely performance data in MongoDB, each collection is a type of performance report, and each document is the measurement at that point in time for the port on the array: { "DateTime" : ISODate("2012-09-28T15:51:03.671Z"), "array_serial" : "12345", "Port Name" : "CL1-A", "metric" : 104.2 } There can be up to 128 different "Port Name" entries per "array_serial". As the data ages I'd like to be able to average it out over increasing time spans: Up to 1 Week : minute 1 Week to 1

saving json data in hdfs in hadoop

阅读更多关于 saving json data in hdfs in hadoop

问题 I have the following Reducer class public static class TokenCounterReducer extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { JSONObject jsn = new JSONObject(); for (Text value : values) { String[] vals = value.toString().split("\t"); String[] targetNodes = vals[0].toString().split(",",-1); jsn.put("source",vals[1] ); jsn.put("target",targetNodes); } // context.write(key, new Text(sum)); } }

Hadoop - how are map-reduce tasks know which part of a file to handle?

阅读更多关于 Hadoop - how are map-reduce tasks know which part of a file to handle?

问题 I've been starting to learn hadoop, and currently I'm trying to process log files that are not too well structured - in that the value I normally use for the M/R key is typiclly found at the top of the file (once). So basically my mapping function takes that value as key and then scans the rest of the file to aggregate the values needed to be reduced. So a [fake] log might look like this: ## log.1 SOME-KEY 2012-01-01 10:00:01 100 2012-01-02 08:48:56 250 2012-01-03 11:01:56 212 .... many more