Hadoop集群大数据解决方案之MapReduce 程序实战进阶(自定义partition&sort&group)(六)

半世苍凉 提交于 2020-02-04 02:48:17

准 备

  在上一篇博客举了个简单的word count,重在说明mapreduce的流程,但是针对mapreduce的编程,程序员能控制的,远远不止map和reduce,还有诸如partition,sort,group 以及combiner都是可以控制的,这里就举个例子来说明这些,这个例子不太适合combiner,这个下次有机会再说明,目的在说明:
   1.自定义排序;
   2.自定义分区;
   3.自定义分组。

需 求

   1 ) 样例数据(日期时间 温度[日期时间和温度之间用的Tab键隔开]) :
·

1949-05-01 14:21:01	38℃
1949-06-03 13:45:01	40℃
1950-01-23 14:21:01	38℃
1950-10-23 08:21:01	12℃
1951-12-18 14:21:01	40℃
1950-10-24 13:21:01	42℃
1950-10-26 14:21:01	45℃
1951-08-01 14:21:01	40℃
1951-08-02 14:21:01	48℃
1953-07-01 14:21:01	45℃
1953-08-06 14:21:01	48℃
1954-06-02 14:21:01	36℃
1952-08-02 14:21:01	45℃
1955-06-02 14:21:01	42℃
1952-04-02 14:21:01	43℃
1953-05-02 14:21:01	34℃
1949-09-02 14:21:01	29℃
1953-10-02 14:21:01	47℃
1952-11-02 14:21:01	45℃
1953-04-02 14:21:01	40℃
1954-05-02 14:21:01	45℃
1955-07-02 14:21:01	28℃
1954-05-09 14:21:01	50℃
1955-09-02 14:21:01	49℃
1953-09-02 14:21:01	32℃

   2 )要求:

1.计算在1949——1955年,每年温度最高的时间;
2.计算在1949——1955年,每年温度最高前3天;

   3 ) 思路:

  1.按照年份升序,同时每一年中温度降序;
  2.按照年份分组,每一年对应的reduce任务;
  3.再选出reduce输出吧结果的top 1和 top 3即可;
  
  核心.mapper输出:key 为封装的对象,以(年份升序和温度降序)为key,需要自定义一个数据类型;

实 战

   1 )自定义一个来存储 “以年份升序和温度降序”为key的数据类型

   想想,想要“以年份升序和温度降序的数据类型,明显是没有现成的Writable类型能满足,那就需要自己写一个,代码如下:


package temperture;

import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;

//自定义的类型MyKeyYT不能直接使用于org.apache.hadoop.io,需要继承接口WritableComparable,里面接一个泛型,就是他自己
//继承接口WritableComparable需要重写readFields,writecompareTo
public class MyKeyYT implements WritableComparable<MyKeyYT>
{
  private int year;
  private int hot;

  public void setYear(int year) {
      this.year = year;
  }

  public int getYear() {
      return year;
  }

  public void setHot(int hot) {
      this.hot = hot;
  }

  public int getHot() {
      return hot;
  }

  //hadoop使用的是rpc协议,里面的数据是二进制流,转化成对象需要反系列化readFields
  public void readFields(DataInput dataInput) throws IOException
  {
      this.year=dataInput.readInt();
      this.hot=dataInput.readInt();

  }

  //序列化,将对象中的year和hot序列化成二进制流
  public void write(DataOutput dataOutput) throws IOException
  {
      dataOutput.writeInt(year);
      dataOutput.writeInt(hot);
  }

  //比较,将传入的MyKeyYT o对象与当前对象进行比较,确定是否为同一个key
  public int compareTo(MyKeyYT o)
  {
      int myresult = Integer.compare(year,o.getYear());
      if(myresult!=0)
          return myresult;
      return Integer.compare(hot,o.getHot());
  }

  //重写toString
  @Override
  public String toString() {
      return year+"\t"+hot;
  }

  //重写hashCode,随便写,只要和之前的不一致就可
  @Override
  public int hashCode() {
      return new Integer(year+hot).hashCode();
  }
}

   2 )重写partition

   需要得到每个年份单独一个reduce,显然,默认的hash partition达不到这一点,所以需要重写partition类,让每个年年份自己一个分区,代码如下:


package temperture;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

//继承Partitioner重写分区函数
public class MyPartition  extends Partitioner<MyKeyYT, Text> {

   //重写分区函数,myKeyYT为map的输出key,text为map的输出value,i为reduce的个数
   @Override
   public int getPartition(MyKeyYT myKeyYT, Text text, int i)
   {
       //按年份进行分区,*200只是放大一下数字
       return (myKeyYT.getYear()*200)%i;
   }
}

   3 )重写sort排序

   默认的排序是按照字典排序,显然满足不了以年份升序和温度降序的数值大小排序,所以需要重写sort类,代码如下:

package temperture;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

//重写排序
public class MySortTemp extends WritableComparator
{
    //重写一下构造方法
    public  MySortTemp()
    {
        //将Map的输出进行比较
        super(MyKeyYT.class,true);
    }

    //重写最重要的compare方法
    public int compare(WritableComparable a, WritableComparable b)
    {
        MyKeyYT myKeyYT1=(MyKeyYT) a;
        MyKeyYT myKeyYT2=(MyKeyYT) b;
        int myresult = Integer.compare(myKeyYT1.getYear(),myKeyYT2.getYear()); //升序排序
        if(myresult!=0)
            return myresult;
        return -Integer.compare(myKeyYT1.getHot(),myKeyYT2.getHot()); //负号降序排序
    }
}

   4 )重写group

   reduce之前,默认是吧key相同的值分为一个组,而我们的key是“了以年份升序和温度降序”显然不满足想要的结果每一年为一个组给到reduce做计算,所以需要重写分组函数,代码如下:




package temperture;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

//分组和排序其实是一样的,排序好以后将相同的值为一组
public class MyGroup  extends WritableComparator {
    //重写构造方法
   public MyGroup()
   {
       super(MyKeyYT.class,true);
   }

   //将排序的那段copy过来,只要对年份进行分组,所以只取年份的那一部分
    public int compare(WritableComparable a, WritableComparable b)
    {
        MyKeyYT myKeyYT1=(MyKeyYT) a;
        MyKeyYT myKeyYT2=(MyKeyYT) b;
        return Integer.compare(myKeyYT1.getYear(),myKeyYT2.getYear()); //判断是否同一组

    }
}

   5 )调用程序的主函数类

  这次偷懒一下,把map和reduce放在一个class里面,注意我们自定义的部分,在程序加载时需要指出,不然会去加载默认的类型,我们自己白写了,程序还可能报错,自定义部分如下:


            myjob.setMapOutputKeyClass(MyKeyYT.class);//指定Map的输出key类型
            myjob.setMapOutputValueClass(Text.class);//指定Map输出的value的类型
            myjob.setNumReduceTasks(7);//指定reduce的个数,有7个年份
            myjob.setPartitionerClass(MyPartition.class); //引用自定义的partition
            myjob.setSortComparatorClass(MySortTemp.class); //引用自定义的sort排序
            myjob.setGroupingComparatorClass(MyGroup.class); //引用自定义的分组

 

  整体main函数class如下:

package temperture;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;

public class MyTemperatureRunJob  {
    public static SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    static class MyTemperatureMapper extends Mapper<LongWritable, Text,MyKeyYT,Text>
    {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String []ss =line.split("\t");

            if(ss.length==2)
            {
                try
                {
                    Date mydate=sdf.parse(ss[0]); //获取数组的第一个元素

                    //拿年份
                    Calendar myCalendar=Calendar.getInstance();
                    myCalendar.setTime(mydate);
                    int year =myCalendar.get(1);

                    String myhot = ss[1].substring(0,ss[1].indexOf("℃"));

                    //创建自定的MyKeyYT对象
                    MyKeyYT myKeyYT=new MyKeyYT();
                    myKeyYT.setYear(year);
                    myKeyYT.setHot(Integer.parseInt(myhot));

                    context.write(myKeyYT,value);

                } catch (Exception e) {
                    e.printStackTrace();
                }

            }
        }
    }

    static class MyTemperatureReducer extends Reducer<MyKeyYT,Text,MyKeyYT,Text>
    {
        @Override
        protected void reduce(MyKeyYT key, Iterable<Text> values, Context context) throws IOException, InterruptedException
        {
            for(Text v:values)
            {
                context.write(key,v);
            }
        }
    }

    public static void main(String[] args)
    {
        //获取环境变量,设置提交该Job的mapred.job.tracker
        Configuration conf =new Configuration();

        //配置mapreduce.job.tracker,
        //和集群mapred-site.xml里面的属性 保持一致即可,
        //此句也可以不写,集群上跑会自动获取直接省略。
        // conf.set("mapreduce.job.tracker","dw-cluster-master:9001");

        try
        {
            //mapreduce输出结果会自动创建folder,
            //但是如果指定的输出target folder如果已存在,是会报错的,
            //这段是做容错,可以让程序rerun
            Path outputPath= new Path(args[2]);
            FileSystem fileSystem =FileSystem.get(conf);
            if(fileSystem.exists(outputPath)){
                fileSystem.delete(outputPath,true);
                System.out.println("outputPath is exist,but has deleted!");
            }

            Job myjob= Job.getInstance(conf);
            myjob.setJarByClass(MyTemperatureRunJob.class);//指定调用的WcJobRun Class打成Jar再跑
            myjob.setMapperClass(MyTemperatureMapper.class);//指定Map类
            myjob.setReducerClass(MyTemperatureReducer.class);//指定Reduce类
            myjob.setMapOutputKeyClass(MyKeyYT.class);//指定Map的输出key类型
            myjob.setMapOutputValueClass(Text.class);//指定Map输出的value的类型

            myjob.setNumReduceTasks(7);//指定reduce的个数,有7个年份
            myjob.setPartitionerClass(MyPartition.class); //引用自定义的partition
            myjob.setSortComparatorClass(MySortTemp.class); //引用自定义的sort排序
            myjob.setGroupingComparatorClass(MyGroup.class); //引用自定义的分组


            //为什么用args[1],因为args[0]第一个参数留给main方法所在的Class
            FileInputFormat.addInputPath(myjob,new Path(args[1]));//指定整个Job的输入文件路径,args[1]表示调用Jar包时,紧跟Jar包的第二个参数
            //FileInputFormat.addInputPath(myjob,new Path("/tmp/wcinput/wordcount.xt"));
//指定整个Job的输出文件路径,args[2]表示调用Jar包时,紧跟Jar包的第三个参数
            FileOutputFormat.setOutputPath(myjob,new Path(args[2]));
            //FileOutputFormat.setOutputPath(myjob,new Path("/tmp/wcoutput"));
            System.exit(myjob.waitForCompletion(true)?0:1);//等待Job完成,正确完成则退出
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }



    }

}

部署和调用

  打包部署请参考Hadoop集群大数据解决方案之IDE配Maven实现MapReduce 程序实战(五)的打包部署

  将jar包上传到集群,调用参考指令:

 hadoop jar hadoop_mr_temperature.jar MyTemperatureRunJob /tmp/input        /data1.txt /tmp/output/

  整体执行流程:

[liuxiaowei@dw-cluster-master temperature]$ hadoop jar hadoop_mr_temperature.jar MyTemperatureRunJob /tmp/input        /data1.txt /tmp/output/
outputPath is exist,but has deleted!
20/02/03 15:40:20 INFO client.RMProxy: Connecting to ResourceManager at dw-cluster-master/10.216.10.141:8032
20/02/03 15:40:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement         the Tool interface and execute your application with ToolRunner to remedy this.
20/02/03 15:40:22 INFO input.FileInputFormat: Total input files to process : 1
20/02/03 15:40:22 INFO mapreduce.JobSubmitter: number of splits:1
20/02/03 15:40:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1578394893972_0056
20/02/03 15:40:22 INFO impl.YarnClientImpl: Submitted application application_1578394893972_0056
20/02/03 15:40:22 INFO mapreduce.Job: The url to track the job: http://dw-cluster-master:8088/proxy/application        _1578394893972_0056/
20/02/03 15:40:22 INFO mapreduce.Job: Running job: job_1578394893972_0056
20/02/03 15:40:28 INFO mapreduce.Job: Job job_1578394893972_0056 running in uber mode : false
20/02/03 15:40:28 INFO mapreduce.Job:  map 0% reduce 0%
20/02/03 15:40:33 INFO mapreduce.Job:  map 100% reduce 0%
20/02/03 15:40:38 INFO mapreduce.Job:  map 100% reduce 29%
20/02/03 15:40:39 INFO mapreduce.Job:  map 100% reduce 100%
20/02/03 15:40:40 INFO mapreduce.Job: Job job_1578394893972_0056 completed successfully
20/02/03 15:40:41 INFO mapreduce.Job: Counters: 50
        File System Counters
                FILE: Number of bytes read=942
                FILE: Number of bytes written=1291003
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=761
                HDFS: Number of bytes written=850
                HDFS: Number of read operations=24
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=14
        Job Counters
                Killed reduce tasks=1
                Launched map tasks=1
                Launched reduce tasks=7
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=3331
                Total time spent by all reduces in occupied slots (ms)=18830
                Total time spent by all map tasks (ms)=3331
                Total time spent by all reduce tasks (ms)=18830
                Total vcore-milliseconds taken by all map tasks=3331
                Total vcore-milliseconds taken by all reduce tasks=18830
                Total megabyte-milliseconds taken by all map tasks=3410944
                Total megabyte-milliseconds taken by all reduce tasks=19281920
        Map-Reduce Framework
                Map input records=25
                Map output records=25
                Map output bytes=850
                Map output materialized bytes=942
                Input split bytes=111
                Combine input records=0
                Combine output records=0
                Reduce input groups=7
                Reduce shuffle bytes=942
                Reduce input records=25
                Reduce output records=25
                Spilled Records=50
                Shuffled Maps =7
                Failed Shuffles=0
                Merged Map outputs=7
                GC time elapsed (ms)=481
                CPU time spent (ms)=6580
                Physical memory (bytes) snapshot=2787332096
                Virtual memory (bytes) snapshot=51036344320
                Total committed heap usage (bytes)=2913992704
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=650
        File Output Format Counters
                Bytes Written=850

  输出结果如图1,有7个年份,每年1个reduce,所有有7个文件:
在这里插入图片描述

图1 整体输出结果

  每一年最高温度其实就是每个文件的第一行,前3就是每个文件的top 3,所有测试数据如下:

[liuxiaowei@dw-cluster-master temperature]$ hadoop fs -cat /tmp/output/part-*
1953    48      1953-08-06 14:21:01     48℃
1953    47      1953-10-02 14:21:01     47℃
1953    45      1953-07-01 14:21:01     45℃
1953    40      1953-04-02 14:21:01     40℃
1953    34      1953-05-02 14:21:01     34℃
1953    32      1953-09-02 14:21:01     32℃
1955    49      1955-09-02 14:21:01     49℃
1955    42      1955-06-02 14:21:01     42℃
1955    28      1955-07-02 14:21:01     28℃
1950    45      1950-10-26 14:21:01     45℃
1950    42      1950-10-24 13:21:01     42℃
1950    38      1950-01-23 14:21:01     38℃
1950    12      1950-10-23 08:21:01     12℃
1952    45      1952-08-02 14:21:01     45℃
1952    45      1952-11-02 14:21:01     45℃
1952    43      1952-04-02 14:21:01     43℃
1954    50      1954-05-09 14:21:01     50℃
1954    45      1954-05-02 14:21:01     45℃
1954    36      1954-06-02 14:21:01     36℃
1949    40      1949-06-03 13:45:01     40℃
1949    38      1949-05-01 14:21:01     38℃
1949    29      1949-09-02 14:21:01     29℃
1951    48      1951-08-02 14:21:01     48℃
1951    40      1951-08-01 14:21:01     40℃
1951    40      1951-12-18 14:21:01     40℃

  取前三如下,取前一自己写。

[liuxiaowei@dw-cluster-master temperature]$ hadoop fs -cat /tmp/output/part-r-00000 | head  -n3
1953    48      1953-08-06 14:21:01     48℃
1953    47      1953-10-02 14:21:01     47℃
1953    45      1953-07-01 14:21:01     45℃

Github整体项目

  传送门:hadoop_mr_temperature

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!