MapReduce

How does the MapReduce sort algorithm work?

。_饼干妹妹 提交于 2019-12-17 14:59:50
问题 One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment. To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sorting involves comparing "everything" with "everything". Your average sorting algorithm (quick, bubble, ...) simply does this in a smart way. In my mind splitting the

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

廉价感情. 提交于 2019-12-17 11:54:05
问题 Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on Elastic Map Reduce. Well I'm trying to run this on my own Hadoop cluser. I ran the job using the following command. python density.py tiny.dat -r hadoop --hadoop-bin /usr/bin/hadoop > outputmusic And this is what I get: HADOOP: Running job: job

Writing to HDFS could only be replicated to 0 nodes instead of minReplication (=1)

a 夏天 提交于 2019-12-17 10:37:09
问题 I have 3 data nodes running, while running a job i am getting the following given below error , java.io.IOException: File /user/ashsshar/olhcache/loaderMap9b663bd9 could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1325) This error mainly comes when our DataNode instances have ran out of space or if

TypeError: list indices must be integers, not str Python

南楼画角 提交于 2019-12-17 10:01:09
问题 list[s] is a string. Why doesn't this work? The following error appears: TypeError: list indices must be integers, not str list = ['abc', 'def'] map_list = [] for s in list: t = (list[s], 1) map_list.append(t) 回答1: list1 = ['abc', 'def'] list2=[] for t in list1: for h in t: list2.append(h) map_list = [] for x,y in enumerate(list2): map_list.append(x) print (map_list) Output: >>> [0, 1, 2, 3, 4, 5] >>> This is what you want exactly. If you dont want to reach each element then: list1 = ['abc',

No such method exception Hadoop <init>

泄露秘密 提交于 2019-12-17 09:20:18
问题 When I am running a Hadoop .jar file from the command prompt, it throws an exception saying no such method StockKey method. StockKey is my custom class defined for my own type of key. Here is the exception: 12/07/12 00:18:47 INFO mapred.JobClient: Task Id : attempt_201207082224_0007_m_000000_1, Status : FAILED java.lang.RuntimeException: java.lang.NoSuchMethodException: SecondarySort$StockKey. <init>() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org

Split size vs Block size in Hadoop

懵懂的女人 提交于 2019-12-17 08:52:33
问题 What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size? 回答1: In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want

MongoDB Stored Procedure Equivalent

折月煮酒 提交于 2019-12-17 07:15:08
问题 I have a large CSV file containing a list of stores, in which one of the field is ZipCode. I have a separate MongoDB database called ZipCodes, which stores the latitude and longitude for any given zip code. In SQL Server, I would execute a stored procedure called InsertStore which would do a look up on the ZipCodes table to get corresponding latitude and longitude and insert the data into the Stores table. Is there something similar to the concept of stored procedures in MongoDB for this?

Is gzip format supported in Spark?

▼魔方 西西 提交于 2019-12-17 07:14:39
问题 For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS. However, in the official documentation, I can't find any hint as to how to process gzipped files. In practice, it can be quite efficient to process .gz files instead of unzipped files. Is there a way to manually implement reading of gzipped files or is unzipping already automatically done when reading a .gz file? 回答1:

Remove Duplicates from MongoDB

﹥>﹥吖頭↗ 提交于 2019-12-17 06:14:40
问题 hi I have a ~5 million documents in mongodb (replication) each document 43 fields. how to remove duplicate document. I tryed db.testkdd.ensureIndex({ duration : 1 , protocol_type : 1 , service : 1 , flag : 1 , src_bytes : 1 , dst_bytes : 1 , land : 1 , wrong_fragment : 1 , urgent : 1 , hot : 1 , num_failed_logins : 1 , logged_in : 1 , num_compromised : 1 , root_shell : 1 , su_attempted : 1 , num_root : 1 , num_file_creations : 1 , num_shells : 1 , num_access_files : 1 , num_outbound_cmds : 1

Hadoop DistributedCache is deprecated - what is the preferred API?

丶灬走出姿态 提交于 2019-12-17 05:41:03
问题 My map tasks need some configuration data, which I would like to distribute via the Distributed Cache. The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows: // In the driver JobConf conf = new JobConf(getConf(), WordCount.class); ... DistributedCache.addCacheFile(new Path(filename).toUri(), conf); // In the mapper Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job); ... However, DistributedCache is marked as deprecated in Hadoop 2.2.0.