MapReduce | 易学教程

How does the MapReduce sort algorithm work?

阅读更多关于 How does the MapReduce sort algorithm work?

问题 One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment. To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sorting involves comparing "everything" with "everything". Your average sorting algorithm (quick, bubble, ...) simply does this in a smart way. In my mind splitting the

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

阅读更多关于 Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

问题 Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on Elastic Map Reduce. Well I'm trying to run this on my own Hadoop cluser. I ran the job using the following command. python density.py tiny.dat -r hadoop --hadoop-bin /usr/bin/hadoop > outputmusic And this is what I get: HADOOP: Running job: job

Writing to HDFS could only be replicated to 0 nodes instead of minReplication (=1)

阅读更多关于 Writing to HDFS could only be replicated to 0 nodes instead of minReplication (=1)

问题 I have 3 data nodes running, while running a job i am getting the following given below error , java.io.IOException: File /user/ashsshar/olhcache/loaderMap9b663bd9 could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1325) This error mainly comes when our DataNode instances have ran out of space or if

TypeError: list indices must be integers, not str Python

阅读更多关于 TypeError: list indices must be integers, not str Python

问题 list[s] is a string. Why doesn't this work? The following error appears: TypeError: list indices must be integers, not str list = ['abc', 'def'] map_list = [] for s in list: t = (list[s], 1) map_list.append(t) 回答1: list1 = ['abc', 'def'] list2=[] for t in list1: for h in t: list2.append(h) map_list = [] for x,y in enumerate(list2): map_list.append(x) print (map_list) Output: >>> [0, 1, 2, 3, 4, 5] >>> This is what you want exactly. If you dont want to reach each element then: list1 = ['abc',

No such method exception Hadoop <init>

阅读更多关于 No such method exception Hadoop

问题 When I am running a Hadoop .jar file from the command prompt, it throws an exception saying no such method StockKey method. StockKey is my custom class defined for my own type of key. Here is the exception: 12/07/12 00:18:47 INFO mapred.JobClient: Task Id : attempt_201207082224_0007_m_000000_1, Status : FAILED java.lang.RuntimeException: java.lang.NoSuchMethodException: SecondarySort$StockKey. <init>() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org

Split size vs Block size in Hadoop

阅读更多关于 Split size vs Block size in Hadoop

问题 What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size? 回答1: In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want

MongoDB Stored Procedure Equivalent

阅读更多关于 MongoDB Stored Procedure Equivalent

问题 I have a large CSV file containing a list of stores, in which one of the field is ZipCode. I have a separate MongoDB database called ZipCodes, which stores the latitude and longitude for any given zip code. In SQL Server, I would execute a stored procedure called InsertStore which would do a look up on the ZipCodes table to get corresponding latitude and longitude and insert the data into the Stores table. Is there something similar to the concept of stored procedures in MongoDB for this?

Is gzip format supported in Spark?

阅读更多关于 Is gzip format supported in Spark?

问题 For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS. However, in the official documentation, I can't find any hint as to how to process gzipped files. In practice, it can be quite efficient to process .gz files instead of unzipped files. Is there a way to manually implement reading of gzipped files or is unzipping already automatically done when reading a .gz file? 回答1:

Remove Duplicates from MongoDB

阅读更多关于 Remove Duplicates from MongoDB

问题 hi I have a ~5 million documents in mongodb (replication) each document 43 fields. how to remove duplicate document. I tryed db.testkdd.ensureIndex({ duration : 1 , protocol_type : 1 , service : 1 , flag : 1 , src_bytes : 1 , dst_bytes : 1 , land : 1 , wrong_fragment : 1 , urgent : 1 , hot : 1 , num_failed_logins : 1 , logged_in : 1 , num_compromised : 1 , root_shell : 1 , su_attempted : 1 , num_root : 1 , num_file_creations : 1 , num_shells : 1 , num_access_files : 1 , num_outbound_cmds : 1

Hadoop DistributedCache is deprecated - what is the preferred API?

阅读更多关于 Hadoop DistributedCache is deprecated - what is the preferred API?

问题 My map tasks need some configuration data, which I would like to distribute via the Distributed Cache. The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows: // In the driver JobConf conf = new JobConf(getConf(), WordCount.class); ... DistributedCache.addCacheFile(new Path(filename).toUri(), conf); // In the mapper Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job); ... However, DistributedCache is marked as deprecated in Hadoop 2.2.0.