MapReduce | 易学教程

Class Cast exception for the Hadoop new API

阅读更多关于 Class Cast exception for the Hadoop new API

问题 i have trying to cough up with some simple code using Map reduce framework. Previously I had implemented using mapred package and I was able to specify the input format class as KeyvalueTextInputFormat But in the new Api using mapreduce this class is not present. I tried using the TextInputFormat.class but i still get the following exception - job_local_0001 java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text at com.hp.hpl

How do I get a list of MongoDB documents that are referenced inside another collection

阅读更多关于 How do I get a list of MongoDB documents that are referenced inside another collection

问题 I am trying to find a way to get a list of MongoDB documents that are referenced in a subdocument in another collection. I have a collection with user documents. In another collection I keep a list of businesses. Every business has a subdocument containing a list of references to users. The User collection: /* user-1 */ { "_id" : ObjectId("54e5e78680c7e191218b49b0"), "username" : "jachim@example.com", "password" "$2y$13$21p6hx3sd200cko4o0w04u46jNv3tNl3qpVWVbnAyzZpDxsSVDDLS" } /* user-2 */ { "

mongodb mapreduce scope - ReferenceError

阅读更多关于 mongodb mapreduce scope - ReferenceError

问题 I'm trying to use an external object inside mongodb map/reduce functions. If the object has a variable which it should access, an error occurs. For example: var conn = new Mongo(); var db = conn.getDB("test"); var HelperClass = function() { var v = [1, 2, 3]; this.data = function() { return v; }; }; var helper = new HelperClass(); var map = function() { helper.data().forEach(function(value) { emit(value, 1); }); }; var reduce = function(key, values) { var count = 0; values.forEach(function

Group document by their key in MonogoDB MapReduce

阅读更多关于 Group document by their key in MonogoDB MapReduce

问题 I am trying MapReduce program in MongoDB, I have MongoDB collection with following data type: { "_id" : ObjectId("57aea85af405910cfcd2bfeb"), "friendList" : [ "Karma", " Tom", " Ram", " Bindu", " Shiva", " Kishna", " Bikash", " Bakshi", " Dinesh" ], "user" : "Hari" } { "_id" : ObjectId("57aea85bf405910cfcd2bfec"), "friendList" : [ "Karma", " Sita", " Bakshi", " Hanks", " Shyam", " Bikash" ], "user" : "Howard" } { "_id" : ObjectId("57aea85cf405910cfcd2bfed"), "friendList" : [ "Dinesh", " Ram",

Submitting a Hadoop job

阅读更多关于 Submitting a Hadoop job

问题 I need to constantly get the mappers' and reducers' running time. I have submitted the job as follows. JobClient jobclient = new JobClient(conf); RunningJob runjob = jobclient.submitJob(conf); TaskReport [] maps = jobclient.getMapTaskReports(runjob.getID()); long mapDuration = 0; for(TaskReport rpt: maps){ mapDuration += rpt.getFinishTime() - rpt.getStartTime(); } However when I run the program, it seems like the job is not submitted and the mapper never starts. How can I use JobClient.runJob

RavenDB indexing errors

阅读更多关于 RavenDB indexing errors

问题 I'm just getting started with Raven and an index I've created keeps failing to index anything. I've found a lot of errors on the Raven server that look like this: { Index: "HomeBlurb/IncludeTotalCosts", Error: "Cannot implicitly convert type 'double' to 'int'. An explicit conversion exists (are you missing a cast?)", Timestamp: "2012-01-14T15:40:40.8943226Z", Document: null } The index I've created looks like this: public class HomeBlurb_IncludeTotalCosts : AbstractIndexCreationTask

Distributed Cache and performance Hadoop

阅读更多关于 Distributed Cache and performance Hadoop

问题 I want to make my understanding about hadoop distributed cache clear. I know that when we add files to distributed cache, the files get loaded to the disk of every node in the cluster. So how do the data of the files get transmitted to all the nodes in the cluster. Is it through the network? If so, will it not cause a strain on the network? I have the following thoughts, are they correct? If the files are large, wont there be network congestion? If the number of nodes are large, even though

Faster implementation for reduceByKey on Seq of pairs possible?

阅读更多关于 Faster implementation for reduceByKey on Seq of pairs possible?

问题 The code below contains various single-threaded implementations of reduceByKeyXXX methods and a few helper methods to create input sets and measure execution times. (Feel free to run the main -method) The main purpose of reduceByKey (as in Spark) is to reduce key-value-pairs with the same key. Example: scala> val xs = Seq( "a" -> 2, "b" -> 3, "a" -> 5) xs: Seq[(String, Int)] = List((a,2), (b,3), (a,5)) scala> ReduceByKeyComparison.reduceByKey(xs, (x:Int, y:Int) ⇒ x+y ) res8: Seq[(String, Int)

Debug MapReduce (of Hadoop 2.2 or higher) in Eclipse

阅读更多关于 Debug MapReduce (of Hadoop 2.2 or higher) in Eclipse

问题 I am able to debug MapReduce (of Hadoop 1.2.1) in Eclipse by following the steps in http://www.thecloudavenue.com/2012/10/debugging-hadoop-mapreduce-program-in.html. But how do I debug MapReduce (of Hadoop 2.2 or higher) in Eclipse? 回答1: You can debug in same way. You just run you MapReduce code in standalone mode and use eclipse to debug MR code like any Java code. 回答2: Here are the steps I setup in Eclipse. Environment: Ubuntu 16.04.2, Eclipse Neon.3 Release (4.6.3RC2), jdk1.8.0_121. I did

Copying files from HDFS to local file system with JAVA

阅读更多关于 Copying files from HDFS to local file system with JAVA

问题 I am trying to copy files from HDFS to local filesystem for preprocessing. The below code should work according to the documentation. Although it doesn't give any error messages and the mapreduce job runs smoothly I can not see any output on my local hard drive. What do you think the problem is? Thanks. try { Path phdfs_input = new Path("hdfs://master:54310/user/hduser/conninput/"+value.toString()); Path plocal_input = new Path("/home/hduser/Desktop/"+avlue.toString()); FileSystem fs =