Hadoop集群大数据解决方案之MapReduce 程序实战进阶(自定义partition&sort&group)（六）

准备

在上一篇博客举了个简单的word count，重在说明mapreduce的流程，但是针对mapreduce的编程，程序员能控制的，远远不止map和reduce，还有诸如partition，sort，group 以及combiner都是可以控制的，这里就举个例子来说明这些，这个例子不太适合combiner，这个下次有机会再说明，目的在说明：
1.自定义排序；
2.自定义分区；
3.自定义分组。

需求

1 ) 样例数据（日期时间温度[日期时间和温度之间用的Tab键隔开]）：
·

1949-05-01 14:21:01	38℃
1949-06-03 13:45:01	40℃
1950-01-23 14:21:01	38℃
1950-10-23 08:21:01	12℃
1951-12-18 14:21:01	40℃
1950-10-24 13:21:01	42℃
1950-10-26 14:21:01	45℃
1951-08-01 14:21:01	40℃
1951-08-02 14:21:01	48℃
1953-07-01 14:21:01	45℃
1953-08-06 14:21:01	48℃
1954-06-02 14:21:01	36℃
1952-08-02 14:21:01	45℃
1955-06-02 14:21:01	42℃
1952-04-02 14:21:01	43℃
1953-05-02 14:21:01	34℃
1949-09-02 14:21:01	29℃
1953-10-02 14:21:01	47℃
1952-11-02 14:21:01	45℃
1953-04-02 14:21:01	40℃
1954-05-02 14:21:01	45℃
1955-07-02 14:21:01	28℃
1954-05-09 14:21:01	50℃
1955-09-02 14:21:01	49℃
1953-09-02 14:21:01	32℃

2 )要求：

1.计算在1949——1955年，每年温度最高的时间；
2.计算在1949——1955年，每年温度最高前3天；

3 ) 思路：

  1.按照年份升序，同时每一年中温度降序；
  2.按照年份分组，每一年对应的reduce任务；
  3.再选出reduce输出吧结果的top 1和 top 3即可；
  
  核心.mapper输出：key 为封装的对象，以（年份升序和温度降序）为key，需要自定义一个数据类型；

实战

1 )自定义一个来存储 “以年份升序和温度降序”为key的数据类型

想想，想要“以年份升序和温度降序的数据类型，明显是没有现成的Writable类型能满足，那就需要自己写一个，代码如下：


package temperture;

import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;

//自定义的类型MyKeyYT不能直接使用于org.apache.hadoop.io，需要继承接口WritableComparable，里面接一个泛型，就是他自己
//继承接口WritableComparable需要重写readFields，writecompareTo
public class MyKeyYT implements WritableComparable<MyKeyYT>
{
  private int year;
  private int hot;

  public void setYear(int year) {
      this.year = year;
  }

  public int getYear() {
      return year;
  }

  public void setHot(int hot) {
      this.hot = hot;
  }

  public int getHot() {
      return hot;
  }

  //hadoop使用的是rpc协议，里面的数据是二进制流，转化成对象需要反系列化readFields
  public void readFields(DataInput dataInput) throws IOException
  {
      this.year=dataInput.readInt();
      this.hot=dataInput.readInt();

  }

  //序列化，将对象中的year和hot序列化成二进制流
  public void write(DataOutput dataOutput) throws IOException
  {
      dataOutput.writeInt(year);
      dataOutput.writeInt(hot);
  }

  //比较，将传入的MyKeyYT o对象与当前对象进行比较，确定是否为同一个key
  public int compareTo(MyKeyYT o)
  {
      int myresult = Integer.compare(year,o.getYear());
      if(myresult!=0)
          return myresult;
      return Integer.compare(hot,o.getHot());
  }

  //重写toString
  @Override
  public String toString() {
      return year+"\t"+hot;
  }

  //重写hashCode，随便写，只要和之前的不一致就可
  @Override
  public int hashCode() {
      return new Integer(year+hot).hashCode();
  }
}

2 )重写partition

需要得到每个年份单独一个reduce，显然，默认的hash partition达不到这一点，所以需要重写partition类，让每个年年份自己一个分区，代码如下：


package temperture;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

//继承Partitioner重写分区函数
public class MyPartition  extends Partitioner<MyKeyYT, Text> {

   //重写分区函数,myKeyYT为map的输出key，text为map的输出value，i为reduce的个数
   @Override
   public int getPartition(MyKeyYT myKeyYT, Text text, int i)
   {
       //按年份进行分区，*200只是放大一下数字
       return (myKeyYT.getYear()*200)%i;
   }
}

3 )重写sort排序

默认的排序是按照字典排序，显然满足不了以年份升序和温度降序的数值大小排序，所以需要重写sort类，代码如下：

package temperture;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

//重写排序
public class MySortTemp extends WritableComparator
{
    //重写一下构造方法
    public  MySortTemp()
    {
        //将Map的输出进行比较
        super(MyKeyYT.class,true);
    }

    //重写最重要的compare方法
    public int compare(WritableComparable a, WritableComparable b)
    {
        MyKeyYT myKeyYT1=(MyKeyYT) a;
        MyKeyYT myKeyYT2=(MyKeyYT) b;
        int myresult = Integer.compare(myKeyYT1.getYear(),myKeyYT2.getYear()); //升序排序
        if(myresult!=0)
            return myresult;
        return -Integer.compare(myKeyYT1.getHot(),myKeyYT2.getHot()); //负号降序排序
    }
}

4 )重写group

reduce之前，默认是吧key相同的值分为一个组，而我们的key是“了以年份升序和温度降序”显然不满足想要的结果每一年为一个组给到reduce做计算，所以需要重写分组函数，代码如下：




package temperture;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

//分组和排序其实是一样的，排序好以后将相同的值为一组
public class MyGroup  extends WritableComparator {
    //重写构造方法
   public MyGroup()
   {
       super(MyKeyYT.class,true);
   }

   //将排序的那段copy过来，只要对年份进行分组，所以只取年份的那一部分
    public int compare(WritableComparable a, WritableComparable b)
    {
        MyKeyYT myKeyYT1=(MyKeyYT) a;
        MyKeyYT myKeyYT2=(MyKeyYT) b;
        return Integer.compare(myKeyYT1.getYear(),myKeyYT2.getYear()); //判断是否同一组

    }
}

5 )调用程序的主函数类

这次偷懒一下，把map和reduce放在一个class里面，注意我们自定义的部分，在程序加载时需要指出，不然会去加载默认的类型，我们自己白写了，程序还可能报错，自定义部分如下：


            myjob.setMapOutputKeyClass(MyKeyYT.class);//指定Map的输出key类型
            myjob.setMapOutputValueClass(Text.class);//指定Map输出的value的类型
            myjob.setNumReduceTasks(7);//指定reduce的个数，有7个年份
            myjob.setPartitionerClass(MyPartition.class); //引用自定义的partition
            myjob.setSortComparatorClass(MySortTemp.class); //引用自定义的sort排序
            myjob.setGroupingComparatorClass(MyGroup.class); //引用自定义的分组

整体main函数class如下：

package temperture;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;

public class MyTemperatureRunJob  {
    public static SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    static class MyTemperatureMapper extends Mapper<LongWritable, Text,MyKeyYT,Text>
    {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String []ss =line.split("\t");

            if(ss.length==2)
            {
                try
                {
                    Date mydate=sdf.parse(ss[0]); //获取数组的第一个元素

                    //拿年份
                    Calendar myCalendar=Calendar.getInstance();
                    myCalendar.setTime(mydate);
                    int year =myCalendar.get(1);

                    String myhot = ss[1].substring(0,ss[1].indexOf("℃"));

                    //创建自定的MyKeyYT对象
                    MyKeyYT myKeyYT=new MyKeyYT();
                    myKeyYT.setYear(year);
                    myKeyYT.setHot(Integer.parseInt(myhot));

                    context.write(myKeyYT,value);

                } catch (Exception e) {
                    e.printStackTrace();
                }

            }
        }
    }

    static class MyTemperatureReducer extends Reducer<MyKeyYT,Text,MyKeyYT,Text>
    {
        @Override
        protected void reduce(MyKeyYT key, Iterable<Text> values, Context context) throws IOException, InterruptedException
        {
            for(Text v:values)
            {
                context.write(key,v);
            }
        }
    }

    public static void main(String[] args)
    {
        //获取环境变量,设置提交该Job的mapred.job.tracker
        Configuration conf =new Configuration();

        //配置mapreduce.job.tracker，
        //和集群mapred-site.xml里面的属性 保持一致即可，
        //此句也可以不写，集群上跑会自动获取直接省略。
        // conf.set("mapreduce.job.tracker","dw-cluster-master:9001");

        try
        {
            //mapreduce输出结果会自动创建folder，
            //但是如果指定的输出target folder如果已存在，是会报错的，
            //这段是做容错，可以让程序rerun
            Path outputPath= new Path(args[2]);
            FileSystem fileSystem =FileSystem.get(conf);
            if(fileSystem.exists(outputPath)){
                fileSystem.delete(outputPath,true);
                System.out.println("outputPath is exist,but has deleted!");
            }

            Job myjob= Job.getInstance(conf);
            myjob.setJarByClass(MyTemperatureRunJob.class);//指定调用的WcJobRun Class打成Jar再跑
            myjob.setMapperClass(MyTemperatureMapper.class);//指定Map类
            myjob.setReducerClass(MyTemperatureReducer.class);//指定Reduce类
            myjob.setMapOutputKeyClass(MyKeyYT.class);//指定Map的输出key类型
            myjob.setMapOutputValueClass(Text.class);//指定Map输出的value的类型

            myjob.setNumReduceTasks(7);//指定reduce的个数，有7个年份
            myjob.setPartitionerClass(MyPartition.class); //引用自定义的partition
            myjob.setSortComparatorClass(MySortTemp.class); //引用自定义的sort排序
            myjob.setGroupingComparatorClass(MyGroup.class); //引用自定义的分组


            //为什么用args[1]，因为args[0]第一个参数留给main方法所在的Class
            FileInputFormat.addInputPath(myjob,new Path(args[1]));//指定整个Job的输入文件路径，args[1]表示调用Jar包时，紧跟Jar包的第二个参数
            //FileInputFormat.addInputPath(myjob,new Path("/tmp/wcinput/wordcount.xt"));
//指定整个Job的输出文件路径，args[2]表示调用Jar包时，紧跟Jar包的第三个参数
            FileOutputFormat.setOutputPath(myjob,new Path(args[2]));
            //FileOutputFormat.setOutputPath(myjob,new Path("/tmp/wcoutput"));
            System.exit(myjob.waitForCompletion(true)?0:1);//等待Job完成，正确完成则退出
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }



    }

}

部署和调用

打包部署请参考Hadoop集群大数据解决方案之IDE配Maven实现MapReduce 程序实战（五）的打包部署

将jar包上传到集群，调用参考指令：

 hadoop jar hadoop_mr_temperature.jar MyTemperatureRunJob /tmp/input        /data1.txt /tmp/output/

整体执行流程：

[liuxiaowei@dw-cluster-master temperature]$ hadoop jar hadoop_mr_temperature.jar MyTemperatureRunJob /tmp/input        /data1.txt /tmp/output/
outputPath is exist,but has deleted!
20/02/03 15:40:20 INFO client.RMProxy: Connecting to ResourceManager at dw-cluster-master/10.216.10.141:8032
20/02/03 15:40:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement         the Tool interface and execute your application with ToolRunner to remedy this.
20/02/03 15:40:22 INFO input.FileInputFormat: Total input files to process : 1
20/02/03 15:40:22 INFO mapreduce.JobSubmitter: number of splits:1
20/02/03 15:40:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1578394893972_0056
20/02/03 15:40:22 INFO impl.YarnClientImpl: Submitted application application_1578394893972_0056
20/02/03 15:40:22 INFO mapreduce.Job: The url to track the job: http://dw-cluster-master:8088/proxy/application        _1578394893972_0056/
20/02/03 15:40:22 INFO mapreduce.Job: Running job: job_1578394893972_0056
20/02/03 15:40:28 INFO mapreduce.Job: Job job_1578394893972_0056 running in uber mode : false
20/02/03 15:40:28 INFO mapreduce.Job:  map 0% reduce 0%
20/02/03 15:40:33 INFO mapreduce.Job:  map 100% reduce 0%
20/02/03 15:40:38 INFO mapreduce.Job:  map 100% reduce 29%
20/02/03 15:40:39 INFO mapreduce.Job:  map 100% reduce 100%
20/02/03 15:40:40 INFO mapreduce.Job: Job job_1578394893972_0056 completed successfully
20/02/03 15:40:41 INFO mapreduce.Job: Counters: 50
        File System Counters
                FILE: Number of bytes read=942
                FILE: Number of bytes written=1291003
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=761
                HDFS: Number of bytes written=850
                HDFS: Number of read operations=24
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=14
        Job Counters
                Killed reduce tasks=1
                Launched map tasks=1
                Launched reduce tasks=7
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=3331
                Total time spent by all reduces in occupied slots (ms)=18830
                Total time spent by all map tasks (ms)=3331
                Total time spent by all reduce tasks (ms)=18830
                Total vcore-milliseconds taken by all map tasks=3331
                Total vcore-milliseconds taken by all reduce tasks=18830
                Total megabyte-milliseconds taken by all map tasks=3410944
                Total megabyte-milliseconds taken by all reduce tasks=19281920
        Map-Reduce Framework
                Map input records=25
                Map output records=25
                Map output bytes=850
                Map output materialized bytes=942
                Input split bytes=111
                Combine input records=0
                Combine output records=0
                Reduce input groups=7
                Reduce shuffle bytes=942
                Reduce input records=25
                Reduce output records=25
                Spilled Records=50
                Shuffled Maps =7
                Failed Shuffles=0
                Merged Map outputs=7
                GC time elapsed (ms)=481
                CPU time spent (ms)=6580
                Physical memory (bytes) snapshot=2787332096
                Virtual memory (bytes) snapshot=51036344320
                Total committed heap usage (bytes)=2913992704
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=650
        File Output Format Counters
                Bytes Written=850

输出结果如图1，有7个年份，每年1个reduce，所有有7个文件：
在这里插入图片描述

图1 整体输出结果

每一年最高温度其实就是每个文件的第一行，前3就是每个文件的top 3，所有测试数据如下：

[liuxiaowei@dw-cluster-master temperature]$ hadoop fs -cat /tmp/output/part-*
1953    48      1953-08-06 14:21:01     48℃
1953    47      1953-10-02 14:21:01     47℃
1953    45      1953-07-01 14:21:01     45℃
1953    40      1953-04-02 14:21:01     40℃
1953    34      1953-05-02 14:21:01     34℃
1953    32      1953-09-02 14:21:01     32℃
1955    49      1955-09-02 14:21:01     49℃
1955    42      1955-06-02 14:21:01     42℃
1955    28      1955-07-02 14:21:01     28℃
1950    45      1950-10-26 14:21:01     45℃
1950    42      1950-10-24 13:21:01     42℃
1950    38      1950-01-23 14:21:01     38℃
1950    12      1950-10-23 08:21:01     12℃
1952    45      1952-08-02 14:21:01     45℃
1952    45      1952-11-02 14:21:01     45℃
1952    43      1952-04-02 14:21:01     43℃
1954    50      1954-05-09 14:21:01     50℃
1954    45      1954-05-02 14:21:01     45℃
1954    36      1954-06-02 14:21:01     36℃
1949    40      1949-06-03 13:45:01     40℃
1949    38      1949-05-01 14:21:01     38℃
1949    29      1949-09-02 14:21:01     29℃
1951    48      1951-08-02 14:21:01     48℃
1951    40      1951-08-01 14:21:01     40℃
1951    40      1951-12-18 14:21:01     40℃

取前三如下，取前一自己写。

[liuxiaowei@dw-cluster-master temperature]$ hadoop fs -cat /tmp/output/part-r-00000 | head  -n3
1953    48      1953-08-06 14:21:01     48℃
1953    47      1953-10-02 14:21:01     47℃
1953    45      1953-07-01 14:21:01     45℃

Github整体项目

传送门:hadoop_mr_temperature

来源：CSDN

作者：卖脐橙的鬼谷大师兄

链接：https://blog.csdn.net/LXWalaz1s1s/article/details/104152527

标签

Hadoop

MapReduce

Hadoop集群大数据解决方案之MapReduce 程序实战进阶(自定义partition&sort&group)（六）

准 备

需 求

实 战

部署和调用

Github整体项目

准备

需求

实战