Hadoop MapReduce job starts but can not find Map class?

问题

My MapReduce app counts usage of field values in a Hive table. I managed to build and run it from Eclipse after including all jars from /usr/lib/hadood, /usr/lib/hive and /usr/lib/hcatalog directories. It works.

After many frustrations I have also managed to compile and run it as Maven project:

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.bigdata.hadoop</groupId>
<artifactId>FieldCounts</artifactId>
  <packaging>jar</packaging>
  <name>FieldCounts</name>
  <version>0.0.1-SNAPSHOT</version>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
<dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
<dependency>   
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.3.0</version>
</dependency>    
<dependency>
    <groupId>org.apache.hcatalog</groupId>
    <artifactId>hcatalog-core</artifactId>
    <version>0.11.0</version>
</dependency>
</dependencies>
</project>

To run the job from command line the following script has to be used :

#!/bin/sh
export LIBJARS=/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar,/usr/lib/hive/lib/hive-exec-0.12.0.2.0.6.1-101.jar,/usr/lib/hive/lib/hive-metastore-0.12.0.2.0.6.1-101.jar,/usr/lib/hive/lib/libfb303-0.9.0.jar,/usr/lib/hive/lib/jdo-api-3.0.1.jar,/usr/lib/hive/lib/antlr-runtime-3.4.jar,/usr/lib/hive/lib/datanucleus-api-jdo-3.2.1.jar,/usr/lib/hive/lib/datanucleus-core-3.2.2.jar
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:.:/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar:/usr/lib/hive/lib/hive-exec-0.12.0.2.0.6.1-101.jar:/usr/lib/hive/lib/hive-metastore-0.12.0.2.0.6.1-101.jar:/usr/lib/hive/lib/libfb303-0.9.0.jar:/usr/lib/hive/lib/jdo-api-3.0.1.jar:/usr/lib/hive/lib/antlr-runtime-3.4.jar:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.1.jar:/usr/lib/hive/lib/datanucleus-core-3.2.2.jar
hadoop jar FieldCounts-0.0.1-SNAPSHOT.jar com.bigdata.hadoop.FieldCounts -libjars ${LIBJARS} simple simpout

Now Hadoop creates and starts the job that next fails because Hadoop can not find Map class:

14/03/26 16:25:58 INFO mapreduce.Job: Running job: job_1395407010870_0007
14/03/26 16:26:07 INFO mapreduce.Job: Job job_1395407010870_0007 running in uber mode : false
14/03/26 16:26:07 INFO mapreduce.Job:  map 0% reduce 0%
14/03/26 16:26:13 INFO mapreduce.Job: Task Id : attempt_1395407010870_0007_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.FieldCounts$Map not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1720)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:721)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.FieldCounts$Map not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1718)
... 8 more

Why this happen? Job jar contains all classes including Map:

 jar tvf FieldCounts-0.0.1-SNAPSHOT.jar 
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/
   121 Wed Mar 26 15:51:04 MSK 2014 META-INF/MANIFEST.MF
     0 Wed Mar 26 14:29:58 MSK 2014 com/
     0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/
     0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/
  3992 Fri Mar 21 17:29:22 MSK 2014 hive-site.xml
  4093 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts.class
  2961 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Reduce.class
  1621 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/TableFieldValueKey.class
  4186 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Map.class
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/
  1030 Wed Mar 26 14:28:22 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.xml
   123 Wed Mar 26 14:30:02 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.properties
[hdfs@localhost target]$ jar tvf FieldCounts-0.0.1-SNAPSHOT.jar 
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/
   121 Wed Mar 26 15:51:04 MSK 2014 META-INF/MANIFEST.MF
     0 Wed Mar 26 14:29:58 MSK 2014 com/
     0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/
     0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/
  3992 Fri Mar 21 17:29:22 MSK 2014 hive-site.xml
  4093 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts.class
  2961 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Reduce.class
  1621 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/TableFieldValueKey.class
  4186 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Map.class
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/
  1030 Wed Mar 26 14:28:22 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.xml
   123 Wed Mar 26 14:30:02 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.properties

What is wrong? Should I put Map and Reduce classes in separate files?

MapReduce code:

package com.bigdata.hadoop;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
import org.apache.hcatalog.mapreduce.*;
import org.apache.hcatalog.data.*;
import org.apache.hcatalog.data.schema.*;
import org.apache.log4j.Logger;

public class FieldCounts extends Configured implements Tool {

    public static class Map extends Mapper<WritableComparable, HCatRecord, TableFieldValueKey, IntWritable> {

        static Logger logger = Logger.getLogger("com.foo.Bar");

        static boolean firstMapRun = true;
        static List<String> fieldNameList = new LinkedList<String>();
        /**
         * Return a list of field names not containing `id` field name
         * @param schema
         * @return
         */
        static List<String> getFieldNames(HCatSchema schema) {
            // Filter out `id` name just once
            if (firstMapRun) {
                firstMapRun = false;
                List<String> fieldNames = schema.getFieldNames();
                for (String fieldName : fieldNames) {
                    if (!fieldName.equals("id")) {
                        fieldNameList.add(fieldName);
                    }
                }
            } // if (firstMapRun)
            return fieldNameList;
        }

        @Override
      protected void map( WritableComparable key,
                          HCatRecord hcatRecord,
                          //org.apache.hadoop.mapreduce.Mapper
                          //<WritableComparable, HCatRecord, Text, IntWritable>.Context context)
                          Context context)
            throws IOException, InterruptedException {

            HCatSchema schema = HCatBaseInputFormat.getTableSchema(context.getConfiguration());

           //String schemaTypeStr = schema.getSchemaAsTypeString();
           //logger.info("******** schemaTypeStr ********** : "+schemaTypeStr);

           //List<String> fieldNames = schema.getFieldNames();
            List<String> fieldNames = getFieldNames(schema);
            for (String fieldName : fieldNames) {
                Object value = hcatRecord.get(fieldName, schema);
                String fieldValue = null;
                if (null == value) {
                    fieldValue = "<NULL>";
                } else {
                    fieldValue = value.toString();
                }
                //String fieldNameValue = fieldName+"."+fieldValue;
                //context.write(new Text(fieldNameValue), new IntWritable(1));
                TableFieldValueKey fieldKey = new TableFieldValueKey();
                fieldKey.fieldName = fieldName;
                fieldKey.fieldValue = fieldValue;
                context.write(fieldKey, new IntWritable(1));
            }

        }       
    }

    public static class Reduce extends Reducer<TableFieldValueKey, IntWritable,
                                       WritableComparable, HCatRecord> {

        protected void reduce( TableFieldValueKey key,
                               java.lang.Iterable<IntWritable> values,
                               Context context)
                               //org.apache.hadoop.mapreduce.Reducer<Text, IntWritable,
                               //WritableComparable, HCatRecord>.Context context)
            throws IOException, InterruptedException {
            Iterator<IntWritable> iter = values.iterator();
            int sum = 0;
            // Sum up occurrences of the given key 
            while (iter.hasNext()) {
                IntWritable iw = iter.next();
                sum = sum + iw.get();
            }

            HCatRecord record = new DefaultHCatRecord(3);
            record.set(0, key.fieldName);
            record.set(1, key.fieldValue);
            record.set(2, sum);

            context.write(null, record);
        }
    }

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        args = new GenericOptionsParser(conf, args).getRemainingArgs();

        // To fix Hadoop "META-INFO" (http://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file)
        conf.set("fs.hdfs.impl",
                org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        conf.set("fs.file.impl",
                org.apache.hadoop.fs.LocalFileSystem.class.getName());

        // Get the input and output table names as arguments
        String inputTableName = args[0];
        String outputTableName = args[1];
        // Assume the default database
        String dbName = null;

        Job job = new Job(conf, "FieldCounts");

        HCatInputFormat.setInput(job,
                InputJobInfo.create(dbName, inputTableName, null));
        job.setJarByClass(FieldCounts.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        // An HCatalog record as input
        job.setInputFormatClass(HCatInputFormat.class);

        // Mapper emits TableFieldValueKey as key and an integer as value
        job.setMapOutputKeyClass(TableFieldValueKey.class);
        job.setMapOutputValueClass(IntWritable.class);

        // Ignore the key for the reducer output; emitting an HCatalog record as
        // value
        job.setOutputKeyClass(WritableComparable.class);
        job.setOutputValueClass(DefaultHCatRecord.class);
        job.setOutputFormatClass(HCatOutputFormat.class);

        HCatOutputFormat.setOutput(job,
                OutputJobInfo.create(dbName, outputTableName, null));
        HCatSchema s = HCatOutputFormat.getTableSchema(job);
        System.err.println("INFO: output schema explicitly set for writing:"
                + s);
        HCatOutputFormat.setSchema(job, s);
        return (job.waitForCompletion(true) ? 0 : 1);
    }

    public static void main(String[] args) throws Exception {
        String classpath = System.getProperty("java.class.path");
        System.out.println("*** CLASSPATH: "+classpath);         
        int exitCode = ToolRunner.run(new FieldCounts(), args);
        System.exit(exitCode);
    }
}

回答1:

As I found out the problem was in directory permissions where MapReduce jar was located. This jar was built in a home directory of a regular, not hdfs user. As long as this MRD job outputs results of its work directly into Hive table, it should be run under hdfs user. In case such a job is run under regular user it has no permission to write data into Hive table!

On the other hand a home directory of a regular user in CentOS has 700 permissions. So when you run hadoop jar ... command under user different from the user owning this home directory access to MRD jar gets denied somewhere in the process of loading classes by Hadoop. That's why under hdfs user this job results in java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.MyMap not found.

Recursively changing permission of home directory from 700 to 755 where MRD jar was built solves this problem.

Yet a more important problem remains: How to run a job under regular user so it has permission to write data into a Hive table?

回答2:

I've found the following

Set hadoop system user for client embedded in Java webapp

that allowed me to connect as the expected hadoop user, but jar is not yet uploaded nor executed... ClassNotFound remains

来源：https://stackoverflow.com/questions/22661978/hadoop-mapreduce-job-starts-but-can-not-find-map-class

标签

java

maven

Hadoop

MapReduce