MapReduce | 易学教程

Gridgain failover of master (sender) node

阅读更多关于 Gridgain failover of master (sender) node

问题 I am working on batch processing problem. Solution needs to handle failing hardware. There is master node (which initiates tasks executions) and worker nodes which execute the jobs. I know how failover of worker nodes works but I could not find any information about failover of master nodes. Whenever master node which started a task fails the whole task is canceled. Is there any way to finish task processing then? Could you suggest the best way of implementing failover of master node? Kind

Mongodb Mapreduce giving an error

阅读更多关于 Mongodb Mapreduce giving an error

问题 I have a set of data in mapreduce.. 1000000 records for random formdata collected.. The data structure is as follows : { "_id" : ObjectId("4d9c8318cbb7813ef940d9e6"), "clientid" : 5, "FormData" : { "emailadress" : "SWV" } } { "_id" : ObjectId("4d9c8318cbb7813efb40d9e6"), "clientid" : 4, "FormData" : { "key1" : "VCYU", "key" : "PJO" } } { "_id" : ObjectId("4d9c8318cbb7813efc40d9e6"), "clientid" : 4, "FormData" : { "key1" : "NJ", "key" : "BZ" } } { "_id" : ObjectId("4d9c8318cbb7813efd40d9e6"),

How to left out join two big tables effectively

阅读更多关于 How to left out join two big tables effectively

问题 I have two tables, table_a and table_b, table_a contains 216646500 rows, 7155998163 bytes; table_b contains 1462775 rows, 2096277141 bytes table_a's schema is: c_1, c_2, c_3, c_4 ; table_b's schema is: c_2, c_5, c_6, ... (about 10 columns) I want to do a left_outer join the two tables on the same key col_2, but it has run for 16 hours and hasn't finished yet... The pyspark code is as follow: combine_table = table_a.join(table_b, table_a.col_2 == table_b.col_2, 'left_outer').collect() Is there

ConnectException: Connection refused when run mapreduce in Hadoop

阅读更多关于 ConnectException: Connection refused when run mapreduce in Hadoop

问题 I set up Hadoop(2.6.0) with multi machines mode : 1 namenode + 3 datanodes. When I used command : start-all.sh, they (namenode, datanode, resource manager, node manager) worked ok. I checked it with jps command and result on each node were bellow: NameNode : 7300 ResourceManager 6942 NameNode 7154 SecondaryNameNode DataNodes: 3840 DataNode 3924 NodeManager And I also uploaded sample text file on HDFS at: /user/hadoop/data/sample.txt. Absolutely no error at that moment. But when I tried to run

Overriding TableMapper splits

阅读更多关于 Overriding TableMapper splits

问题 I am using the following code to read from a table which has its row keys having a format of "epoch_meter" where epoch is the long representation of the date time in seconds and meter is a meter number. Job jobCalcDFT = Job.getInstance(confCalcIndDeviation); jobCalcDFT.setJarByClass(CalculateIndividualDeviation.class); Scan scan = new Scan(Bytes.toBytes(String.valueOf(startSeconds) + "_"), Bytes.toBytes(String.valueOf(endSeconds + 1) + "_")); scan.setCaching(500); scan.setCacheBlocks(false);

Writing to multiple HBASE Tables, how do I use context.write(hkey, put)?

阅读更多关于 Writing to multiple HBASE Tables, how do I use context.write(hkey, put)?

问题 I am new to Hadoop MapReduce. I would like to perform multiple tables writes from my reducer function. Which will be something like, if anything is getting written to Table1 then I want the same content in table 2 also. I have gone through the posts like Write to multiple tables in HBASE and checked the "MultiTableOutputFormat". But what I don't understand there is that according to the post in reducer function I should just use context.write(new ImmutableBytesWritable(Bytes.toBytes(

How to remove duplicate values using MapReduce

阅读更多关于 How to remove duplicate values using MapReduce

问题 I have on data set as below - Key Value k1 a1,b1,c1,d1 k2 a2,b1,c2,d2 k3 a3,b1,c3,d3 k4 a4,b1,c4,d4 k5 a5,b1,c5,d5 In above data set Keys are distinct and in values one of comma separated value i.e. b1 is common among all value set. And my requirement is like if that value is same then out of those values only one value should be selected as output record. In short i want to remove duplicate values when keys are distinct. Can anybody tell me how to approach? I have below implementation - a.

App Engine - Task Queue Retry Count with Mapper API

阅读更多关于 App Engine - Task Queue Retry Count with Mapper API

问题 here is what I'm trying to do: I set up a MapReduce job with the new Mapper API. This basically works fine. The problem is that the Task Queue retries all tasks that have failed. But actually I don't want him to do that. Is there a way to delete a task from the queue or tell it that the task was completed successfully? Perhaps passing a 200 status code? I know that I can fetch the X-Appengine-Taskretrycount, but that doesn't really help since I don't know how to stop the task. I tried using a

How to collect Hadoop Cluster Size/Number of Cores Information

阅读更多关于 How to collect Hadoop Cluster Size/Number of Cores Information

问题 I am running my hadoop jobs on a cluster consisting of multiple machines whose sizes are not known (main memory, number of cores, size etc.. per machine). Without using any OS specific library (*.so files I mean), is there any class or tools for hadoop in itself or some additional libraries where I could collect information like while the Hadoop MR jobs are being executed: Total Number of cores / number of cores employed by the job Total available main memory / allocated available main memory

Consolidate MapReduce logs

阅读更多关于 Consolidate MapReduce logs

问题 Debugging Hadoop map-reduce jobs is a pain. I can print out to stdout, but these logs show up on all of the different machines on which the MR job was run. I can go to the jobtracker, find my job, and click on each individual mapper to get to its task log, but this is extremely cumbersome when you have 20+ mapper/reducers. I was thinking that I might have to write a script that would scape through the job tracker to figure out what machine each of the mappers/reducers ran on and then scp the