MapReduce

Running Spark app on EMR is slow

僤鯓⒐⒋嵵緔 提交于 2019-12-08 06:45:52
问题 I am new to Spark and MApReduce and I have a problem running Spark on Elastic Map Reduce (EMR) AWS cluster. Th problem is that running on EMR taking for me a lot of time. For, example, I have a few millions record in .csv file, that I read and converted in JavaRDD. For Spark, it took 104.99 seconds to calculate simple mapToDouble() and sum() functions on this dataset. While, when I did the same calculations without Spark, using Java8 and converting .csv file to List, it took only 0.5 seconds.

pass a command line argument to jvm(java) mapper task

为君一笑 提交于 2019-12-08 06:24:36
问题 I want to debug some parts of my mapper for which I need to pass some command line arguments to the jvm(java) process which starts the mapper. What are the different ways to do this? I figured out one way to change MapTaskRunner.java, but I want to avoid compiling the whole hadoop package. There should be some simple way using a configuration file to pass extra command line arguments to the jvm mapper process. 回答1: I guess you are looking for the following configuration in mapred-config.xml:

How to have lzo compression in hadoop mapreduce?

你说的曾经没有我的故事 提交于 2019-12-08 06:19:50
问题 I want to use lzo to compress map output but I can't run it! The version of Hadoop I used is 0.20.2 . I set: conf.set("mapred.compress.map.output", "true") conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.LzoCodec"); When I run the jar file in Hadoop it shows an exception that can't write map output. Do I have to install lzo? What do I have to do to use lzo? 回答1: LZO's licence (GPL) is incompatible with that of Hadoop (Apache) and therefore it cannot be bundled

NoServerForRegionException while running Hadoop MapReduce job on HBase

吃可爱长大的小学妹 提交于 2019-12-08 06:00:57
问题 I am executing a simple Hadoop MapReduce program with HBase as an input and output. I am getting the error: java.lang.RuntimeException: org.apache.hadoop.hbase.client.NoServerForRegionException: Unable to find region for OutPut,,99999999999999 after 10 tries. 回答1: This exception appeared to us when there was difference in hbase version. Our code was built with and running with 0.94.X version of hbase jars. Whereas the hbase server was running on 0.90.3. When we changed our pom file with right

Sort and shuffle optimization in Hadoop MapReduce

泄露秘密 提交于 2019-12-08 05:25:10
问题 I'm looking for a research/implementation based project on Hadoop and I came across the list posted on the wiki page - http://wiki.apache.org/hadoop/ProjectSuggestions. But, this page was last updated in September, 2009. So, I'm not sure if some of these ideas have already been implemented or not. I was particularly interested in "Sort and Shuffle optimization in the MR framework" which talks about "combining the results of several maps on rack or node before the shuffle. This can reduce seek

MapReduce: Log file locations for stdout and std err

孤人 提交于 2019-12-08 05:24:43
问题 If I output some message thru stdout (System.out in Java) and stderr (System.err in Java) in Mapper and Reducer, where can I see them in task tracker node? I guess the directory location is configurable thru some parameter as well? 回答1: This might depend on which distribution you are using but with our cdh3 setup, we can find them under /usr/lib/hadoop-0.20/logs/userlogs// on the node where the task ran. For example, stderr would be under: /usr/lib/hadoop-0.20/logs/userlogs/job_201207010432

How to read in a RCFile

送分小仙女□ 提交于 2019-12-08 05:06:20
问题 I am trying to read in a small RCFile (~200 rows of data) into a HashMap to do a Map-Side join, but I having a lot of trouble getting the data in the file into a usable state. Here is what I have so far, most of which is lifted from this example: public void configure(JobConf job) { try { FileSystem fs = FileSystem.get(job); RCFile.Reader rcFileReader = new RCFile.Reader(fs, new Path("/path/to/file"), job); int counter = 1; while (rcFileReader.next(new LongWritable(counter))) { System.out

Output file contains Mapper Output instead of Reducer output

China☆狼群 提交于 2019-12-08 05:02:36
问题 Hi I am trying to find average of few numbers using map reduce technique in stand alone mode. I have two input files.It contain values file1: 25 25 25 25 25 and file2: 15 15 15 15 15 . My program is working fine but the output file contains output of the mapper instead of reducer output. Here is my code : import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import

MapReduce - how do I calculate relative values (average, top k and so)?

我的未来我决定 提交于 2019-12-08 04:30:32
问题 I'm looking for a way to calculate "global" or "relative" values during a MapReduce process - an average, sum, top etc. Say I have a list of workers, with their IDs associated with their salaries (and a bunch of other stuff). At some stage of the processing, I'd like to know who are the workers who earn the top 10% of salaries. For that I need some "global" view of the values, which I can't figure out. If I have all values sent into a single reducer, it has that global view, but then I loose

STDIN or file as mapper input in Hadoop environment?

让人想犯罪 __ 提交于 2019-12-08 04:26:18
问题 As we need to read in bunch of files to mapper, in non-Hadoop environment, I use os.walk(dir) and file=open(path, mode) to read in each file. However, in Hadoop environment, as I read that HadoopStreaming convert file input to stdin of mapper and conver stdout of reducer to file output, I have a few questions about how to input file: Do we have to set input from STDIN in mapper.py and let HadoopStreaming convert files in hdfs input directory to STDIN? If I want to read in each file separately