MapReduce

couchdb map/reduce view: counting only the most recent items

[亡魂溺海] 提交于 2019-12-08 04:07:43
问题 I have the following documents. Time stamped positions of keywords. { _id: willem-aap-1234, keyword:aap, position: 10, profile: { name: willem }, created_at: 1234 }, { _id: willem-aap-2345, keyword:aap, profile: { name: willem }, created_at: 2345 }, { _id: oliver-aap-1235, keyword:aap, profile: { name: oliver }, created_at: 1235 }, { _id: oliver-aap-2346, keyword:aap, profile: { name: oliver }, created_at: 2346 } Finding the most recent keywords per profile.name can be done by: map: function

MapReduce using MongoDB Java Driver failes with wrong type for BSONElement assertion

 ̄綄美尐妖づ 提交于 2019-12-08 04:00:35
问题 I'm pretty new to MongoDB and MapReduce. I need to do some MapReduce on a collection in my DB. The MAP and REDUCE_MAX functions work, since I was able to accomplish my needs in the Mongo interactive shell (v.1.8.2). However, I get an error trying to perform the same thing using the Mongo Java Driver (v. 2.6.3) My MAP and REDUCE_MAX functions look like this: String MAP = "function(){" + "if(this.type != \"checkin\"){return;}" + "if(!this.venue && !this.venue.id){return;}" + "emit({userId:this

How to make your mapper write on local file system in hadoop

一笑奈何 提交于 2019-12-08 04:00:03
问题 I wish to write a file and create a directory in my local file system through m MapReduce code. Also if I create a directory in the working directory during the job execution, how can I move it to my local file system before the cleanup. 回答1: As your mapper runs on some/any machine in your cluster, of course you can use basic Java file operations to write files. You can use org.apache.hadoop.hdfs.DFSClient to access any files on the HDFS to copy to a local file (I'd suggest you copy inside

Yarn parsing job logs stored in hdfs

我只是一个虾纸丫 提交于 2019-12-08 03:38:13
问题 Is there any parser, which I can use to parse the json present in yarn job logs(jhist files) which gets stored in hdfs to extract information from it. 回答1: The second line in the .jhist file is the avro schema for the other jsons in the file. Meaning that you can create avro data out of the jhist file. For this you could use avro-tools-1.7.7.jar # schema is the second line sed -n '2p;3q' file.jhist > schema.avsc # removing the first two lines sed '1,2d' file.jhist > pfile.jhist # finally

Load Native Shared Libraries in a HBase MapReduce task

和自甴很熟 提交于 2019-12-08 02:56:06
问题 Recently I'm trying to implementing my algorithm in JNI codes(using C++).I did that and generate a shared library. Here is my JNI class. public class VideoFeature{ // JNI Code Begin public native static float Match(byte[] testFileBytes, byte[] tempFileBytes); static { System.loadLibrary("JVideoFeatureMatch"); } // JNI Code End } In the main function, I write // MapReduce Configuration conf = HBaseConfiguration.create(); // DistributedCache shared library DistributedCache.createSymlink(conf);

MRjob: Can a reducer perform 2 operations?

血红的双手。 提交于 2019-12-08 02:53:53
问题 I am trying to yield the probability each key,value pair generated from mapper has. So, lets say mapper yields: a, (r, 5) a, (e, 6) a, (w, 7) I need to add 5+6+7 = 18 and then find probabilities 5/18, 6/18, 7/18 so the final output from the reducer would look like: a, [[r, 5, 0.278], [e, 6, 0.33], [w, 7, 0.389]] so far, I can only get the reducer to sum all integers from the value. How can I make it to go back and divide each instance by the total sum? thanks! 回答1: Pai's solution is

Hadoop removes MapReduce history when it is restarted

混江龙づ霸主 提交于 2019-12-08 02:49:35
问题 I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability. During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important

Reducers for Hive data

感情迁移 提交于 2019-12-07 23:31:23
问题 I'm a novice. I'm curious to know how reducers are set to different hive data sets. Is it based on the size of the data processed? Or a default set of reducers for all? For example, 5GB of data requires how many reducers? will the same number of reducers set to smaller data set? Thanks in advance!! Cheers! 回答1: In open source hive (and EMR likely) # reducers = (# bytes of input to mappers) / (hive.exec.reducers.bytes.per.reducer) default hive.exec.reducers.bytes.per.reducer is 1G. Number of

logging the output of a map reduce job to a text file

独自空忆成欢 提交于 2019-12-07 23:25:47
问题 I've been using this jobclient.monitorandprintjob() method to print the output of a map reduce job to the console. My usage is something like this: job_client.monitorAndPrintJob(job_conf, job_client.getJob(j.getAssignedJobID())) The output of which is as follows (printed on the console): 13/03/04 07:20:00 INFO mapred.JobClient: Running job: job_201302211725_10139<br> 13/03/04 07:20:01 INFO mapred.JobClient: map 0% reduce 0%<br> 13/03/04 07:20:08 INFO mapred.JobClient: map 100% reduce 0%<br>

Number of MapReduce tasks

孤街醉人 提交于 2019-12-07 21:40:16
问题 I need some help about how it is possible to get the correct number of Map and Reduce tasks in my application. Is there any way to discover this number? Thanks 回答1: It is not possible to get the actual number of map and reduce tasks for an application before its execution, since the factors of task failures followed by re-attempts and speculative execution attempts cannot be accurately determined prior to execution, an approximate number tasks can be derived. The total number of Map tasks for