MapReduce | 易学教程

Custom Binary Input - Hadoop

阅读更多关于 Custom Binary Input - Hadoop

问题 I am developing a demo application in Hadoop and my input is .mrc image files. I want to load them to hadoop and do some image processing over them. These are binary files that contain a large header with metadata followed by the data of a set of images. The information on how to read the images is also contained in the header (eg. number_of_images, number_of_pixels_x, number_of_pixels_y, bytes_per_pixel, so after the header bytes, the first [number_of_pixels_x*number_of_pixels_y*bytes_per

Can't use CompositeInputFormat with Hadoop, throwing exception Expression is null

阅读更多关于 Can't use CompositeInputFormat with Hadoop, throwing exception Expression is null

问题 I'm using the MRv1 from CDH4 (4.5) and facing a problem with CompositeInputFormat . It doesn't matter how many inputs I try to join. For the sake of simplicity, here's the example with just one input: Configuration conf = new Configuration(); Job job = new Job(conf, "Blah"); job.setJarByClass(Blah.class); job.setMapperClass(Blah.BlahMapper.class); job.setReducerClass(Blah.BlahReducer.class); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(BlahElement.class); job

Loading datetime format files using PIG

阅读更多关于 Loading datetime format files using PIG

问题 I have a dataset in the following way. ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,2006-01-23 avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,2001-02-12 supreeth,savings,prabhash,4,char,55,F,22,12,23,12,44,56,7,88,34,23,1995-03-11 lavi,current,nirmesh,5,char,33,M,11,10,33,34,56,78,54,23,445,66,1999-06-15 Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,2016-05-18 the last column(example:2006-01-23) is date. I am trying to load the above data

PDF input format for Mapreduce Hadoop

阅读更多关于 PDF input format for Mapreduce Hadoop

问题 Hi I anm using PDFBOX external library for parsing the pdf input file in mapreduce,but i am getting the following error. Error: java.lang.ClassNotFoundException: org.apache.pdfbox.pdmodel.PDDocument at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

Hive jobs occurs mapreduce error : Call From hmaster/127.0.0.1 to localhost:44849 failed on connection exception

阅读更多关于 Hive jobs occurs mapreduce error : Call From hmaster/127.0.0.1 to localhost:44849 failed on connection exception

问题 When I run in hive command line: hive > select count(*) from alogs; On the terminal, it shows the following : Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job =

Error during benchmarking Sort in Hadoop2 - Partitions do not match

阅读更多关于 Error during benchmarking Sort in Hadoop2 - Partitions do not match

问题 I am trying to benchmark Hadoop2 MapReduce framework. It is NOT TeraSort. But testmapredsort . step-1 Create random data: hadoop jar hadoop/ randomwriter -Dtest.randomwrite.bytes_per_map=100 -Dtest.randomwriter.maps_per_host=10 /data/unsorted-data step-2 sort the random data created in step-1: hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar sort /data/unsorted-data /data/sorted-data step-3 check if the sorting by MR works: hadoop jar hadoop/share/hadoop/mapreduce

Why can't we calculate job execution time in Hadoop?

阅读更多关于 Why can't we calculate job execution time in Hadoop?

问题 My question is related to Straggler problem. In sort, it's an algorithm and we can know its complexity and calculate the running time when executed on a constant set of data. Why can't we acquire job execution time in Hadoop ? If we can acquire the job execution time or task execution time, we can know the straggler tasks quickly without needing algorithms to know which task is Straggler. 回答1: You should not estimate how much time a job will take before running that job. After running your

java.io.IOException: error=2, No such file or directory eroor in Hadoop streaming

阅读更多关于 java.io.IOException: error=2, No such file or directory eroor in Hadoop streaming

问题 Please help with the "-file" option issue of hadoop streaming (mentioned in the link below). just to update, I know that the jar is already there, I am trying this after I tried hadoop-streaming for a different class file which failed, so to identify if there is something wrong with the class file itself or with the way I am using it. if you need the stderr file please let me know. Problem with Hadoop Streaming -file option for Java class files. 回答1: you can't really use -file to send over

Optimal Block Size for a hadoop Cluster

阅读更多关于 Optimal Block Size for a hadoop Cluster

问题 I am working on a four node multi cluster in hadoop. I have run a series of experiments with the block sizes as follows and calculated run time as follows. All of them are performed on 20GB input file. 64MB - 32 min, 128MB - 19 Min, 256MB - 15 min, 1GB - 12.5 min. Should I proceed further in going for 2GB block size? Also kindly explain an optimal block size if similar operations are performed on 90GB file. Thanks! 回答1: You should test with 2Gb and compare results. Only you consider the next:

Hadoop distributed cache : using -libjars : How to use external jars in your code

阅读更多关于 Hadoop distributed cache : using -libjars : How to use external jars in your code

问题 Okay I am able to add external jars to my code using ilibjars path. Now how to use those external jars in my code. say I have a function defined in that jar which operates on String. How to use it. using context.getArchiveClassPaths(), i can get a path to it but i don't know how to instantiate that object. here is the sample jar class that i am importing package replace; public class ReplacingAcronyms { public static String Replace(String abc){ String n; n="This is trial"; return n; } }