MapReduce

Custom Binary Input - Hadoop

若如初见. 提交于 2019-12-13 05:49:12
问题 I am developing a demo application in Hadoop and my input is .mrc image files. I want to load them to hadoop and do some image processing over them. These are binary files that contain a large header with metadata followed by the data of a set of images. The information on how to read the images is also contained in the header (eg. number_of_images, number_of_pixels_x, number_of_pixels_y, bytes_per_pixel, so after the header bytes, the first [number_of_pixels_x*number_of_pixels_y*bytes_per

Can't use CompositeInputFormat with Hadoop, throwing exception Expression is null

柔情痞子 提交于 2019-12-13 05:37:41
问题 I'm using the MRv1 from CDH4 (4.5) and facing a problem with CompositeInputFormat . It doesn't matter how many inputs I try to join. For the sake of simplicity, here's the example with just one input: Configuration conf = new Configuration(); Job job = new Job(conf, "Blah"); job.setJarByClass(Blah.class); job.setMapperClass(Blah.BlahMapper.class); job.setReducerClass(Blah.BlahReducer.class); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(BlahElement.class); job

Loading datetime format files using PIG

天涯浪子 提交于 2019-12-13 05:34:17
问题 I have a dataset in the following way. ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,2006-01-23 avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,2001-02-12 supreeth,savings,prabhash,4,char,55,F,22,12,23,12,44,56,7,88,34,23,1995-03-11 lavi,current,nirmesh,5,char,33,M,11,10,33,34,56,78,54,23,445,66,1999-06-15 Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,2016-05-18 the last column(example:2006-01-23) is date. I am trying to load the above data

PDF input format for Mapreduce Hadoop

余生长醉 提交于 2019-12-13 05:30:14
问题 Hi I anm using PDFBOX external library for parsing the pdf input file in mapreduce,but i am getting the following error. Error: java.lang.ClassNotFoundException: org.apache.pdfbox.pdmodel.PDDocument at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

Hive jobs occurs mapreduce error : Call From hmaster/127.0.0.1 to localhost:44849 failed on connection exception

我们两清 提交于 2019-12-13 05:19:38
问题 When I run in hive command line: hive > select count(*) from alogs; On the terminal, it shows the following : Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job =

Error during benchmarking Sort in Hadoop2 - Partitions do not match

空扰寡人 提交于 2019-12-13 04:56:50
问题 I am trying to benchmark Hadoop2 MapReduce framework. It is NOT TeraSort. But testmapredsort . step-1 Create random data: hadoop jar hadoop/ randomwriter -Dtest.randomwrite.bytes_per_map=100 -Dtest.randomwriter.maps_per_host=10 /data/unsorted-data step-2 sort the random data created in step-1: hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar sort /data/unsorted-data /data/sorted-data step-3 check if the sorting by MR works: hadoop jar hadoop/share/hadoop/mapreduce

Why can't we calculate job execution time in Hadoop?

五迷三道 提交于 2019-12-13 04:53:51
问题 My question is related to Straggler problem. In sort, it's an algorithm and we can know its complexity and calculate the running time when executed on a constant set of data. Why can't we acquire job execution time in Hadoop ? If we can acquire the job execution time or task execution time, we can know the straggler tasks quickly without needing algorithms to know which task is Straggler. 回答1: You should not estimate how much time a job will take before running that job. After running your

java.io.IOException: error=2, No such file or directory eroor in Hadoop streaming

ぃ、小莉子 提交于 2019-12-13 04:43:45
问题 Please help with the "-file" option issue of hadoop streaming (mentioned in the link below). just to update, I know that the jar is already there, I am trying this after I tried hadoop-streaming for a different class file which failed, so to identify if there is something wrong with the class file itself or with the way I am using it. if you need the stderr file please let me know. Problem with Hadoop Streaming -file option for Java class files. 回答1: you can't really use -file to send over

Optimal Block Size for a hadoop Cluster

为君一笑 提交于 2019-12-13 04:43:02
问题 I am working on a four node multi cluster in hadoop. I have run a series of experiments with the block sizes as follows and calculated run time as follows. All of them are performed on 20GB input file. 64MB - 32 min, 128MB - 19 Min, 256MB - 15 min, 1GB - 12.5 min. Should I proceed further in going for 2GB block size? Also kindly explain an optimal block size if similar operations are performed on 90GB file. Thanks! 回答1: You should test with 2Gb and compare results. Only you consider the next:

Hadoop distributed cache : using -libjars : How to use external jars in your code

非 Y 不嫁゛ 提交于 2019-12-13 04:35:38
问题 Okay I am able to add external jars to my code using ilibjars path. Now how to use those external jars in my code. say I have a function defined in that jar which operates on String. How to use it. using context.getArchiveClassPaths(), i can get a path to it but i don't know how to instantiate that object. here is the sample jar class that i am importing package replace; public class ReplacingAcronyms { public static String Replace(String abc){ String n; n="This is trial"; return n; } }