MapReduce Hadoop on Linux - Change Reduce Key

问题

I've being searching for a proper tutorial online about how to use map and reduce, but almost every code about WordCount sucks and doesn't really explain you how to use each function. I've seen everything about the theory, the keys, the map etc, but there is no CODE for example doing something different than WordCount.

I am using Ubuntu 20.10 on Virtual Box and Hadoop version 3.2.1 (if you need any more info comment me).

My task is to manage a file that contains several data for athletes that took place on the Olympics.

You will see that it contains a variety of info, like name, sex, age, weight, height etc.

I will show an example here (hope you understand it):

ID  Name       Sex  Age Height  Weight  Team    NOC Games   Year          Season  City      
Sport          Event                        Medal
1   A Dijiang  M    24  180     80      China   CHN 1992     Summer 1992  Summer  Barcelona 
Basketball     Basketball Men's Basketball  NA

Until now, I had to deal with data that are same to all of the records, like name or ID,
which are similar to each other.
(imagine having one participant more than once, that is my problem
at different period of time, so reduce cant recognise the records as same)
If I could change the key / recognision of the reduce function to the name for example of the participant, then I should have my correct result.
In this code I search for players that won at least on medal.
My main is:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class NewWordCount {

        public static void main(String[] args) throws Exception {
            
            if(args.length != 3) {
                System.err.println("Give the correct arguments.");
                System.exit(3);
            }
    
            // Job 1.
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf, "count");
            job.setJarByClass(NewWordCount.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            job.setMapperClass(NewWordMapper.class);
            job.setCombinerClass(NewWordReducer.class);
            job.setReducerClass(NewWordReducer.class);
            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job,new Path(args[1]));
            job.waitForCompletion(true);
       }
}

My Mapper is:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class NewWordMapper extends Mapper <LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable();
    private Text word = new Text();

    private String name = new String();
    private String sex = new String();
    private String age = new String();
    private String team = new String();
    private String sport = new String();
    private String games = new String();
    private String sum = new String();

    private String gold = "Gold";
    private String silver = "Silver";
    private String bronze = "Bronze";

    public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException {
    
        if(((LongWritable)key).get() == 0) {
            return;
        }
    
        String line = value.toString();
        String[] arrOfStr = line.split(",");
        int counter = 0;
    
        for(String a : arrOfStr) {
            if(counter == 14) {             
                // setting the type of medal each player has won.
                word.set(a);
            
                // checking if the medal is gold.
                if(a.compareTo(gold) == 0 || a.compareTo(silver) == 0 || a.compareTo(bronze) == 0) {
                    String[] goldenStr = line.split(",");
                    name = goldenStr[1];
                    sex = goldenStr[2];
                    age = goldenStr[3];
                    team = goldenStr[6];
                    sport = goldenStr[12];
                    games = goldenStr[8];
                    sum = name + "," + sex + "," + age + "," + team + "," + sport + "," + games;
                    word.set(sum);
                    context.write(word, one);
                }
            }
            counter++;
        }
    }
}

My Reducer is:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class NewWordReducer extends Reducer <Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    
        int count = 0;
        for(IntWritable val : values) {
        
            String line  = val.toString();
            String[] arrOfStr = line.split(",");
            String name = arrOfStr[0];
        
            count += val.get();
        }
        context.write(key, new IntWritable(count));
    }
}

回答1:

The core idea about MapReduce jobs is that the Map function is used to extract valuable information from the input and "transform" it to key-value pairs, where based on them the Reduce function is gonna be executed for each key separately. Your code seems to show a misunderstanding about the ways of execution of the latter, but that's no biggie, because this is a proper example of the WordCount example.

Let's say we have a file with stats of olympic athletes and their medal performance like you showed under a directory named /olympic_stats in the HDFS as shown below (you see I included records with the same athlete as this example needs to work upon):

1,A Dijiang,M,24,180,80,China,CHN,1992,Summer 1992,Summer,Barcelona,Basketball,Men's Basketball,NA
2,T Kekempourdas,M,33,189,85,Greece,GRE,2004,Summer 2004,Summer,Athens,Judo,Men's Judo,Gold
3,T Kekempourdas,M,33,189,85,Greece,GRE,2000,Summer 2000,Summer,Sydney,Judo,Men's Judo,Bronze
4,K Stefanidi,F,29,183,76,Greece,GRE,2016,Summer 2016,Summer,Rio,Pole Vault, Women's Pole Vault,Silver
5,A Jones,F,26,160,56,Canada,CAN,2012,Summer 2012,Summer,London,Acrobatics,Women's Acrobatics,Gold
5,A Jones,F,26,160,56,Canada,CAN,2016,Summer 2012,Summer,Rio,Acrobatics,Women's Acrobatics,Gold
6,C Glover,M,33,175,80,USA,USA,2008,Summer 2008,Summer,Beijing,Archery,Men's Archery,Gold
7,C Glover,M,33,175,80,USA,USA,2012,Summer 2012,Summer,London,Archery,Men's Archery,Gold
8,C Glover,M,33,175,80,USA,USA,2016,Summer 2016,Summer,Rio,Archery,Men's Archery,Gold

For the Map function, we need to find the one column of data that is good to use as a key in order to calculate how many gold medals each athlete has. As we can easily see from above, every athlete can have one or more records and they all would have his/her name on the second column, so we are sure that we are going to use their name as key on the key-value pairs. As for the value, well we do want to calculate how many gold medals an athlete has so we have to check the 14th column that indicates if and what medal this athlete got. If this record's column is equal to the String Gold then we can be sure that this athlete has at least 1 gold medal in his career so far. So here, as the value, we can just put 1.

Now for the Reduce function, as it is executed separately for each different key, we can understand that the input values it gets from the mappers are going to be for the same exact athlete. Since the key-value pairs that were generated from the mappers had just 1 at their values for each gold medal for the given athlete, we could just add all these 1's up and get the total number of gold medals for each one of them.

So the code for this is like the one below (I'm putting the mapper, reducer, and driver in the same file for the sake of simplicity):

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.*;
import java.io.IOException;
import java.util.*;
import java.nio.charset.StandardCharsets;

public class GoldMedals 
{
    /* input:  <byte_offset, line_of_dataset>
     * output: <Athlete's Name, 1>
     */
    public static class Map extends Mapper<Object, Text, Text, IntWritable> 
    {
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException 
        {
            String record = value.toString();
            String[] columns = record.split(",");

            // extract the athlete's name and his/hers medal indication
            String athlete_name = columns[1];
            String medal = columns[14];

            // only hold the gold medal athletes, with their name as the key
            // and 1 as the least number of gold medals they have so far
            if(medal.equals("Gold")) 
                context.write(new Text(athlete_name), new IntWritable(1));
        }
    }

    /* input:  <Athlete's Name, 1>
     * output: <Athlete's Name, Athlete's Total Gold Medals>
     */
    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
    {
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException 
        {
            int sum = 0;
            
            // for a single athlete, add all of the gold medals they had so far...
            for(IntWritable value : values)
                    sum += value.get();

            // and write the result as the value on the output key-value pairs
            context.write(key, new IntWritable(sum));
        }
    }


    public static void main(String[] args) throws Exception
    {
        // set the paths of the input and output directories in the HDFS
        Path input_dir = new Path("olympic_stats");
        Path output_dir = new Path("gold_medals");

        // in case the output directory already exists, delete it
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(output_dir))
            fs.delete(output_dir, true);

        // configure the MapReduce job
        Job goldmedals_job = Job.getInstance(conf, "Gold Medals Counter");
        goldmedals_job.setJarByClass(GoldMedals.class);
        goldmedals_job.setMapperClass(Map.class);
        goldmedals_job.setCombinerClass(Reduce.class);
        goldmedals_job.setReducerClass(Reduce.class);    
        goldmedals_job.setMapOutputKeyClass(Text.class);
        goldmedals_job.setMapOutputValueClass(IntWritable.class);
        goldmedals_job.setOutputKeyClass(Text.class);
        goldmedals_job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(goldmedals_job, input_dir);
        FileOutputFormat.setOutputPath(goldmedals_job, output_dir);
        goldmedals_job.waitForCompletion(true);
    }
}

The output of the program above is stored inside the /olympic_stats_out directory in the HDFS, which has the following output and confirms that the MapReduce job was designed correctly:

来源：https://stackoverflow.com/questions/65084063/mapreduce-hadoop-on-linux-change-reduce-key

标签

java

Hadoop

MapReduce