准 备
在上一篇博客举了个简单的word count,重在说明mapreduce的流程,但是针对mapreduce的编程,程序员能控制的,远远不止map和reduce,还有诸如partition,sort,group 以及combiner都是可以控制的,这里就举个例子来说明这些,这个例子不太适合combiner,这个下次有机会再说明,目的在说明:
1.自定义排序;
2.自定义分区;
3.自定义分组。
需 求
1 ) 样例数据(日期时间 温度[日期时间和温度之间用的Tab键隔开]) :
·
1949-05-01 14:21:01 38℃
1949-06-03 13:45:01 40℃
1950-01-23 14:21:01 38℃
1950-10-23 08:21:01 12℃
1951-12-18 14:21:01 40℃
1950-10-24 13:21:01 42℃
1950-10-26 14:21:01 45℃
1951-08-01 14:21:01 40℃
1951-08-02 14:21:01 48℃
1953-07-01 14:21:01 45℃
1953-08-06 14:21:01 48℃
1954-06-02 14:21:01 36℃
1952-08-02 14:21:01 45℃
1955-06-02 14:21:01 42℃
1952-04-02 14:21:01 43℃
1953-05-02 14:21:01 34℃
1949-09-02 14:21:01 29℃
1953-10-02 14:21:01 47℃
1952-11-02 14:21:01 45℃
1953-04-02 14:21:01 40℃
1954-05-02 14:21:01 45℃
1955-07-02 14:21:01 28℃
1954-05-09 14:21:01 50℃
1955-09-02 14:21:01 49℃
1953-09-02 14:21:01 32℃
2 )要求:
1.计算在1949——1955年,每年温度最高的时间;
2.计算在1949——1955年,每年温度最高前3天;
3 ) 思路:
1.按照年份升序,同时每一年中温度降序;
2.按照年份分组,每一年对应的reduce任务;
3.再选出reduce输出吧结果的top 1和 top 3即可;
核心.mapper输出:key 为封装的对象,以(年份升序和温度降序)为key,需要自定义一个数据类型;
实 战
1 )自定义一个来存储 “以年份升序和温度降序”为key的数据类型
想想,想要“以年份升序和温度降序的数据类型,明显是没有现成的Writable类型能满足,那就需要自己写一个,代码如下:
package temperture;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;
//自定义的类型MyKeyYT不能直接使用于org.apache.hadoop.io,需要继承接口WritableComparable,里面接一个泛型,就是他自己
//继承接口WritableComparable需要重写readFields,writecompareTo
public class MyKeyYT implements WritableComparable<MyKeyYT>
{
private int year;
private int hot;
public void setYear(int year) {
this.year = year;
}
public int getYear() {
return year;
}
public void setHot(int hot) {
this.hot = hot;
}
public int getHot() {
return hot;
}
//hadoop使用的是rpc协议,里面的数据是二进制流,转化成对象需要反系列化readFields
public void readFields(DataInput dataInput) throws IOException
{
this.year=dataInput.readInt();
this.hot=dataInput.readInt();
}
//序列化,将对象中的year和hot序列化成二进制流
public void write(DataOutput dataOutput) throws IOException
{
dataOutput.writeInt(year);
dataOutput.writeInt(hot);
}
//比较,将传入的MyKeyYT o对象与当前对象进行比较,确定是否为同一个key
public int compareTo(MyKeyYT o)
{
int myresult = Integer.compare(year,o.getYear());
if(myresult!=0)
return myresult;
return Integer.compare(hot,o.getHot());
}
//重写toString
@Override
public String toString() {
return year+"\t"+hot;
}
//重写hashCode,随便写,只要和之前的不一致就可
@Override
public int hashCode() {
return new Integer(year+hot).hashCode();
}
}
2 )重写partition
需要得到每个年份单独一个reduce,显然,默认的hash partition达不到这一点,所以需要重写partition类,让每个年年份自己一个分区,代码如下:
package temperture;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
//继承Partitioner重写分区函数
public class MyPartition extends Partitioner<MyKeyYT, Text> {
//重写分区函数,myKeyYT为map的输出key,text为map的输出value,i为reduce的个数
@Override
public int getPartition(MyKeyYT myKeyYT, Text text, int i)
{
//按年份进行分区,*200只是放大一下数字
return (myKeyYT.getYear()*200)%i;
}
}
3 )重写sort排序
默认的排序是按照字典排序,显然满足不了以年份升序和温度降序的数值大小排序,所以需要重写sort类,代码如下:
package temperture;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
//重写排序
public class MySortTemp extends WritableComparator
{
//重写一下构造方法
public MySortTemp()
{
//将Map的输出进行比较
super(MyKeyYT.class,true);
}
//重写最重要的compare方法
public int compare(WritableComparable a, WritableComparable b)
{
MyKeyYT myKeyYT1=(MyKeyYT) a;
MyKeyYT myKeyYT2=(MyKeyYT) b;
int myresult = Integer.compare(myKeyYT1.getYear(),myKeyYT2.getYear()); //升序排序
if(myresult!=0)
return myresult;
return -Integer.compare(myKeyYT1.getHot(),myKeyYT2.getHot()); //负号降序排序
}
}
4 )重写group
reduce之前,默认是吧key相同的值分为一个组,而我们的key是“了以年份升序和温度降序”显然不满足想要的结果每一年为一个组给到reduce做计算,所以需要重写分组函数,代码如下:
package temperture;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
//分组和排序其实是一样的,排序好以后将相同的值为一组
public class MyGroup extends WritableComparator {
//重写构造方法
public MyGroup()
{
super(MyKeyYT.class,true);
}
//将排序的那段copy过来,只要对年份进行分组,所以只取年份的那一部分
public int compare(WritableComparable a, WritableComparable b)
{
MyKeyYT myKeyYT1=(MyKeyYT) a;
MyKeyYT myKeyYT2=(MyKeyYT) b;
return Integer.compare(myKeyYT1.getYear(),myKeyYT2.getYear()); //判断是否同一组
}
}
5 )调用程序的主函数类
这次偷懒一下,把map和reduce放在一个class里面,注意我们自定义的部分,在程序加载时需要指出,不然会去加载默认的类型,我们自己白写了,程序还可能报错,自定义部分如下:
myjob.setMapOutputKeyClass(MyKeyYT.class);//指定Map的输出key类型
myjob.setMapOutputValueClass(Text.class);//指定Map输出的value的类型
myjob.setNumReduceTasks(7);//指定reduce的个数,有7个年份
myjob.setPartitionerClass(MyPartition.class); //引用自定义的partition
myjob.setSortComparatorClass(MySortTemp.class); //引用自定义的sort排序
myjob.setGroupingComparatorClass(MyGroup.class); //引用自定义的分组
整体main函数class如下:
package temperture;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
public class MyTemperatureRunJob {
public static SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
static class MyTemperatureMapper extends Mapper<LongWritable, Text,MyKeyYT,Text>
{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String []ss =line.split("\t");
if(ss.length==2)
{
try
{
Date mydate=sdf.parse(ss[0]); //获取数组的第一个元素
//拿年份
Calendar myCalendar=Calendar.getInstance();
myCalendar.setTime(mydate);
int year =myCalendar.get(1);
String myhot = ss[1].substring(0,ss[1].indexOf("℃"));
//创建自定的MyKeyYT对象
MyKeyYT myKeyYT=new MyKeyYT();
myKeyYT.setYear(year);
myKeyYT.setHot(Integer.parseInt(myhot));
context.write(myKeyYT,value);
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
static class MyTemperatureReducer extends Reducer<MyKeyYT,Text,MyKeyYT,Text>
{
@Override
protected void reduce(MyKeyYT key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
for(Text v:values)
{
context.write(key,v);
}
}
}
public static void main(String[] args)
{
//获取环境变量,设置提交该Job的mapred.job.tracker
Configuration conf =new Configuration();
//配置mapreduce.job.tracker,
//和集群mapred-site.xml里面的属性 保持一致即可,
//此句也可以不写,集群上跑会自动获取直接省略。
// conf.set("mapreduce.job.tracker","dw-cluster-master:9001");
try
{
//mapreduce输出结果会自动创建folder,
//但是如果指定的输出target folder如果已存在,是会报错的,
//这段是做容错,可以让程序rerun
Path outputPath= new Path(args[2]);
FileSystem fileSystem =FileSystem.get(conf);
if(fileSystem.exists(outputPath)){
fileSystem.delete(outputPath,true);
System.out.println("outputPath is exist,but has deleted!");
}
Job myjob= Job.getInstance(conf);
myjob.setJarByClass(MyTemperatureRunJob.class);//指定调用的WcJobRun Class打成Jar再跑
myjob.setMapperClass(MyTemperatureMapper.class);//指定Map类
myjob.setReducerClass(MyTemperatureReducer.class);//指定Reduce类
myjob.setMapOutputKeyClass(MyKeyYT.class);//指定Map的输出key类型
myjob.setMapOutputValueClass(Text.class);//指定Map输出的value的类型
myjob.setNumReduceTasks(7);//指定reduce的个数,有7个年份
myjob.setPartitionerClass(MyPartition.class); //引用自定义的partition
myjob.setSortComparatorClass(MySortTemp.class); //引用自定义的sort排序
myjob.setGroupingComparatorClass(MyGroup.class); //引用自定义的分组
//为什么用args[1],因为args[0]第一个参数留给main方法所在的Class
FileInputFormat.addInputPath(myjob,new Path(args[1]));//指定整个Job的输入文件路径,args[1]表示调用Jar包时,紧跟Jar包的第二个参数
//FileInputFormat.addInputPath(myjob,new Path("/tmp/wcinput/wordcount.xt"));
//指定整个Job的输出文件路径,args[2]表示调用Jar包时,紧跟Jar包的第三个参数
FileOutputFormat.setOutputPath(myjob,new Path(args[2]));
//FileOutputFormat.setOutputPath(myjob,new Path("/tmp/wcoutput"));
System.exit(myjob.waitForCompletion(true)?0:1);//等待Job完成,正确完成则退出
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
部署和调用
打包部署请参考Hadoop集群大数据解决方案之IDE配Maven实现MapReduce 程序实战(五)的打包部署
将jar包上传到集群,调用参考指令:
hadoop jar hadoop_mr_temperature.jar MyTemperatureRunJob /tmp/input /data1.txt /tmp/output/
整体执行流程:
[liuxiaowei@dw-cluster-master temperature]$ hadoop jar hadoop_mr_temperature.jar MyTemperatureRunJob /tmp/input /data1.txt /tmp/output/
outputPath is exist,but has deleted!
20/02/03 15:40:20 INFO client.RMProxy: Connecting to ResourceManager at dw-cluster-master/10.216.10.141:8032
20/02/03 15:40:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
20/02/03 15:40:22 INFO input.FileInputFormat: Total input files to process : 1
20/02/03 15:40:22 INFO mapreduce.JobSubmitter: number of splits:1
20/02/03 15:40:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1578394893972_0056
20/02/03 15:40:22 INFO impl.YarnClientImpl: Submitted application application_1578394893972_0056
20/02/03 15:40:22 INFO mapreduce.Job: The url to track the job: http://dw-cluster-master:8088/proxy/application _1578394893972_0056/
20/02/03 15:40:22 INFO mapreduce.Job: Running job: job_1578394893972_0056
20/02/03 15:40:28 INFO mapreduce.Job: Job job_1578394893972_0056 running in uber mode : false
20/02/03 15:40:28 INFO mapreduce.Job: map 0% reduce 0%
20/02/03 15:40:33 INFO mapreduce.Job: map 100% reduce 0%
20/02/03 15:40:38 INFO mapreduce.Job: map 100% reduce 29%
20/02/03 15:40:39 INFO mapreduce.Job: map 100% reduce 100%
20/02/03 15:40:40 INFO mapreduce.Job: Job job_1578394893972_0056 completed successfully
20/02/03 15:40:41 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=942
FILE: Number of bytes written=1291003
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=761
HDFS: Number of bytes written=850
HDFS: Number of read operations=24
HDFS: Number of large read operations=0
HDFS: Number of write operations=14
Job Counters
Killed reduce tasks=1
Launched map tasks=1
Launched reduce tasks=7
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3331
Total time spent by all reduces in occupied slots (ms)=18830
Total time spent by all map tasks (ms)=3331
Total time spent by all reduce tasks (ms)=18830
Total vcore-milliseconds taken by all map tasks=3331
Total vcore-milliseconds taken by all reduce tasks=18830
Total megabyte-milliseconds taken by all map tasks=3410944
Total megabyte-milliseconds taken by all reduce tasks=19281920
Map-Reduce Framework
Map input records=25
Map output records=25
Map output bytes=850
Map output materialized bytes=942
Input split bytes=111
Combine input records=0
Combine output records=0
Reduce input groups=7
Reduce shuffle bytes=942
Reduce input records=25
Reduce output records=25
Spilled Records=50
Shuffled Maps =7
Failed Shuffles=0
Merged Map outputs=7
GC time elapsed (ms)=481
CPU time spent (ms)=6580
Physical memory (bytes) snapshot=2787332096
Virtual memory (bytes) snapshot=51036344320
Total committed heap usage (bytes)=2913992704
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=650
File Output Format Counters
Bytes Written=850
输出结果如图1,有7个年份,每年1个reduce,所有有7个文件:
每一年最高温度其实就是每个文件的第一行,前3就是每个文件的top 3,所有测试数据如下:
[liuxiaowei@dw-cluster-master temperature]$ hadoop fs -cat /tmp/output/part-*
1953 48 1953-08-06 14:21:01 48℃
1953 47 1953-10-02 14:21:01 47℃
1953 45 1953-07-01 14:21:01 45℃
1953 40 1953-04-02 14:21:01 40℃
1953 34 1953-05-02 14:21:01 34℃
1953 32 1953-09-02 14:21:01 32℃
1955 49 1955-09-02 14:21:01 49℃
1955 42 1955-06-02 14:21:01 42℃
1955 28 1955-07-02 14:21:01 28℃
1950 45 1950-10-26 14:21:01 45℃
1950 42 1950-10-24 13:21:01 42℃
1950 38 1950-01-23 14:21:01 38℃
1950 12 1950-10-23 08:21:01 12℃
1952 45 1952-08-02 14:21:01 45℃
1952 45 1952-11-02 14:21:01 45℃
1952 43 1952-04-02 14:21:01 43℃
1954 50 1954-05-09 14:21:01 50℃
1954 45 1954-05-02 14:21:01 45℃
1954 36 1954-06-02 14:21:01 36℃
1949 40 1949-06-03 13:45:01 40℃
1949 38 1949-05-01 14:21:01 38℃
1949 29 1949-09-02 14:21:01 29℃
1951 48 1951-08-02 14:21:01 48℃
1951 40 1951-08-01 14:21:01 40℃
1951 40 1951-12-18 14:21:01 40℃
取前三如下,取前一自己写。
[liuxiaowei@dw-cluster-master temperature]$ hadoop fs -cat /tmp/output/part-r-00000 | head -n3
1953 48 1953-08-06 14:21:01 48℃
1953 47 1953-10-02 14:21:01 47℃
1953 45 1953-07-01 14:21:01 45℃
Github整体项目
来源:CSDN
作者:卖脐橙的鬼谷大师兄
链接:https://blog.csdn.net/LXWalaz1s1s/article/details/104152527