HBase与MR、hive集成

HBase当中的数据最终都是存储在HDFS上面的，HBase天生的支持MR的操作，我们可以通过MR直接处理HBase当中的数据，并且MR可以将处理后的结果直接存储到HBase当中去
需求：读取HBase当中一张表的数据，然后将数据写入到HBase当中的另外一张表当中去。注意：我们可以使用TableMapper与TableReducer来实现从HBase当中读取与写入数据
http://hbase.apache.org/2.0/book.html#mapreduce

需求一：读取myuser这张表当中的数据写入到HBase的另外一张表当中去

这里我们将myuser这张表当中f1列族的name和age字段写入到myuser2这张表的f1列族当中去

第一步：创建myuser2这张表

注意：列族的名字要与myuser表的列族名字相同
hbase(main):010:0> create ‘myuser2’,‘f1’

第二步：创建maven工程，导入jar包

注意：在之前工程导入jar包的基础上，添加以下这些jar包即可

  <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-mapreduce -->
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-mapreduce</artifactId>
    <version>2.0.0</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.5</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version> 2.7.5</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.7.5</version>
</dependency>

第三步：开发MR的程序

定义mapper类

public class HBaseMapper extends TableMapper<Text,Put> {
    /**
     * @param key  rowkey
     * @param value  封装了我们一行数据
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(ImmutableBytesWritable key, Result value, Context context) 
    throws IOException, InterruptedException {
        //  f1  name   age    f2   xxx
        //获取到我们的rowkey
        byte[] bytes = key.get();
        Put put = new Put(bytes);
        //获取Result当中所有的列
        List<Cell> cells = value.listCells();
        for (Cell cell : cells) {
            //判断属于哪一个列族
            byte[] family = CellUtil.cloneFamily(cell);
            //获取cell属于哪一个列
            byte[] qualifier = CellUtil.cloneQualifier(cell);
            if(Bytes.toString(family).equals("f1")){
                if(Bytes.toString(qualifier).equals("name") || 
                Bytes.toString(qualifier).equals("age")){
                    put.add(cell);
                }
            }
        }
        if(!put.isEmpty()){
            context.write(new Text(Bytes.toString(bytes)),put);
        }
    }
}

定义reducer类

/**
 * Text  key2的类型
 * Put   value2类型
 * ImmutableBytesWritable   k3的类型
 * V3的类型？？？
 * put 'myuser2','rowkey','f1:name','zhangsan'
 * javaAPI来写通过put对象即可
 *
 */
public class HBaseReducer extends TableReducer<Text,Put,ImmutableBytesWritable> {
    /**
     *
     * @param key  就是我们的key2
     * @param values  就是我们的v2
     * @param context  将我们的数据往外写出去
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<Put> values, Context context) 
    throws IOException, InterruptedException {
        for (Put put : values) {
            context.write(new ImmutableBytesWritable(key.toString().getBytes()),put);
        }
    }
}

定义程序运行main方法

public class HBaseMrMain extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(super.getConf(), "hbaseMR");
        Scan scan = new Scan();
        //使用工具来来初始化我们的mapper类
        /**
         * String table, Scan scan,
         Class<? extends TableMapper> mapper,
         Class<?> outputKeyClass,
         Class<?> outputValueClass, Job job
         */
        TableMapReduceUtil.initTableMapperJob("myuser",scan,HBaseMapper.class, Text.class, Put.class,job);
        /**
         * String table,
         Class<? extends TableReducer> reducer, Job job
         */
        TableMapReduceUtil.initTableReducerJob("myuser2",HBaseReducer.class,job);

        boolean b = job.waitForCompletion(true);
        return b?0:1;
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181");
        int run = ToolRunner.run(configuration, new HBaseMrMain(), args);
        System.exit(run);
    }
}

第四步：运行

运行第一种方式：本地运行

直接选中main方法所在的类，运行即可

运行第二种方式：打包集群运行

注意，我们需要使用打包插件，将HBase的依赖jar包都打入到工程jar包里面去

第一步：pom.xml当中添加打包插件

<plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.4.3</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <minimizeJar>true</minimizeJar>
                    </configuration>
                </execution>
            </executions>
        </plugin>

第二步：代码当中添加

job.setJarByClass(HBaseMain.class);

第三步：使用maven打包

然后打包，将打好的jar包上传到linux服务器，然后执行

yarn jar hbaseStudy-1.0-SNAPSHOT.jar  cn.itcast.hbasemr.HBaseMR

#或者我们也可以自己设置我们的环境变量，然后运行original那个比较小的jar包
export HADOOP_HOME=/export/servers/hadoop-2.7.5/
export HBASE_HOME=/export/servers/hbase-2.0.0/
export HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`
yarn jar original-hbaseStudy-1.0-SNAPSHOT.jar  cn.itcast.hbasemr.HbaseMR

需求二：读取HDFS文件，写入到HBase表当中去

读取hdfs路径/hbase/input/user.txt，然后将数据写入到myuser2这张表当中去

第一步：准备数据文件

准备数据文件，并将数据文件上传到HDFS上面去

hdfs dfs -mkdir -p /hbase/input
cd /export/servers/
vim user.txt

0007    zhangsan        18
0008    lisi    25
0009    wangwu  20

上传到hdfs上面去
hdfs dfs -put user.txt /hbase/input

第二步：开发MR程序

定义mapper类

public class HdfsMapper extends Mapper<LongWritable,Text,Text,NullWritable> {
    /*
    我们map阶段没有做任何处理，直接将我们的数据写出去
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException {
        context.write(value,NullWritable.get());
    }
}

定义reducer类

public class HBaseWriteReducer extends TableReducer<Text,NullWritable,ImmutableBytesWritable> {
    /**
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) 
    throws IOException, InterruptedException {
        String[] split = key.toString().split("\t");

        Put put = new Put(split[0].getBytes());
        put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes());
        put.addColumn("f1".getBytes(),"age".getBytes(),split[2].getBytes());
        //ImmutableBytesWritable 可以装我们的rowkey，或者装我们的value值等等
        context.write(new ImmutableBytesWritable(split[0].getBytes()),put);
    }
}

定义程序运行main方法

public class Hdfs2HBaseMain extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(super.getConf(), "hdfsToHbase");
        //第一步：读取文件，解析成key，value对
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job,new Path("hdfs://node01:8020/hbase/input"));
        //第二步：自定义mapper，接收k1   v1转化成新的k2  v2输出
        job.setMapperClass(HdfsMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        //第三步：分区
        //第四步：排序
        //第五步：规约
        //第六步：分组
        //第七步：reduce逻辑，接收K2  v2  转换成新的k3   v3输出
        TableMapReduceUtil.initTableReducerJob("myuser2",HBaseWriteReducer.class,job);
        //第八步：输出数据
        boolean b = job.waitForCompletion(true);
        return b?0:1;
    }


    public static void main(String[] args) throws Exception {
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181");
        int run = ToolRunner.run(configuration, new Hdfs2HBaseMain(), args);
        System.exit(run);
    }
}

需求三：通过bulkload的方式批量加载数据到HBase当中去

加载数据到HBase当中去的方式多种多样，我们可以使用HBase的javaAPI或者使用sqoop将我们的数据写入或者导入到HBase当中去，但是这些方式不是慢就是在导入的过程的占用Region资源导致效率低下，我们也可以通过MR的程序，将我们的数据直接转换成HBase的最终存储格式HFile，然后直接load数据到HBase当中去即可
HBase中每张Table在根目录（/HBase）下用一个文件夹存储，Table名为文件夹名，在Table文件夹下每个Region同样用一个文件夹存储，每个Region文件夹下的每个列族也用文件夹存储，而每个列族下存储的就是一些HFile文件，HFile就是HBase数据在HFDS下存储格式，所以HBase存储文件最终在hdfs上面的表现形式就是HFile，如果我们可以直接将数据转换为HFile的格式，那么我们的HBase就可以直接读取加载HFile格式的文件，就可以直接读取了
优点：

1.导入过程不占用Region资源

2.能快速导入海量的数据

3.节省内存

HBase数据正常读写流程
在这里插入图片描述

使用bulkload的方式将我们的数据直接生成HFile格式，然后直接加载到HBase的表当中去
在这里插入图片描述
需求：将我们hdfs上面的这个路径/hbase/input/user.txt的数据文件，转换成HFile格式，然后load到myuser2这张表里面去

第一步：定义我们的mapper类

/**
 * LongWritable  k1类型
 * Text   V1类型
 * ImmutableBytesWritable   rowkey
 * Put  插入的对象
 */
public class BulkLoadMapper extends Mapper<LongWritable,Text,ImmutableBytesWritable,Put> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, 
    InterruptedException {
        String[] split = value.toString().split("\t");
        Put put = new Put(split[0].getBytes());
        put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes());
        put.addColumn("f1".getBytes(),"age".getBytes(),split[2].getBytes());
        context.write(new ImmutableBytesWritable(split[0].getBytes()),put);
    }
}

第二步：开发我们的main程序入口类

public class BulkLoadMain extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = super.getConf();
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("myuser2"));
        Job job = Job.getInstance(conf, "bulkLoad");
        //读取文件，解析成key,value对
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job,new Path("hdfs://node01:8020/hbase/input"));
        //定义我们的mapper类
        job.setMapperClass(BulkLoadMapper.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Put.class);
        //reduce过程也省掉
        /**
         * Job job, Table table, RegionLocator regionLocator
         *  使用configureIncrementalLoad来进行配置我们的HFile加载到哪一个表里面的哪一个列族里面去
         */
        HFileOutputFormat2.configureIncrementalLoad(job,table,
        connection.getRegionLocator(TableName.valueOf("myuser2")));
        //设置我们的输出类型，将我们的数据输出成为HFile格式
        job.setOutputFormatClass(HFileOutputFormat2.class);
        //设置我们的输出路径
        HFileOutputFormat2.setOutputPath(job,new Path("hdfs://node01:8020/hbase/hfile_out"));
        boolean b = job.waitForCompletion(true);
        return b?0:1;
    }
    public static void main(String[] args) throws Exception {
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181");
        int run = ToolRunner.run(configuration, new BulkLoadMain(), args);
        System.exit(run);
    }
}

第三步：将代码打成jar包然后进行运行

yarn jar original-hbaseStudy-1.0-SNAPSHOT.jar  cn.itcast.hbasemr.HBaseLoad

第四步：开发代码，加载数据

将我们的输出路径下面的HFile文件，加载到我们的hbase表当中去

public class LoadData {
    public static void main(String[] args) throws Exception {
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.property.clientPort", "2181");
        configuration.set("hbase.zookeeper.quorum", "node01,node02,node03");

        Connection connection =  ConnectionFactory.createConnection(configuration);
        Admin admin = connection.getAdmin();
        Table table = connection.getTable(TableName.valueOf("myuser2"));
        LoadIncrementalHFiles load = new LoadIncrementalHFiles(configuration);
        load.doBulkLoad(new Path("hdfs://node01:8020/hbase/output_hfile"), admin,table,connection.getRegionLocator(TableName.valueOf("myuser2")));
    }

}

或者我们也可以通过命令行来进行加载数据
先将hbase的jar包添加到hadoop的classpath路径下

export HBASE_HOME=/export/servers/hbase-2.0.0/
export HADOOP_HOME=/export/servers/hadoop-2.7.5/
export HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`

然后执行以下命令，将hbase的HFile直接导入到表myuser2当中来

yarn jar /export/servers/hbase-2.0.0/lib/hbase-server-1.2.0-cdh5.14.0.jar 
completebulkload /hbase/hfile_out myuser2

13、HBase与hive的对比

Hive

数据仓库工具

Hive的本质其实就相当于将HDFS中已经存储的文件在Mysql中做了一个双射关系，以方便使用HQL去管理查询。

用于数据分析、清洗

Hive适用于离线的数据分析和清洗，延迟较高

基于HDFS、MapReduce

Hive存储的数据依旧在DataNode上，编写的HQL语句终将是转换为MapReduce代码执行。

HBase

nosql数据库

是一种面向列存储的非关系型数据库。

用于存储结构化和非结构话的数据

适用于单表非关系型数据的存储，不适合做关联查询，类似JOIN等操作。

基于HDFS

数据持久化存储的体现形式是Hfile，存放于DataNode中，被ResionServer以region的形式进行管理。

延迟较低，接入在线业务使用

面对大量的企业数据，HBase可以直线单表大量数据的存储，同时提供了高效的数据访问速度。

总结：Hive与HBase

Hive和Hbase是两种基于Hadoop的不同技术，Hive是一种类SQL的引擎，并且运行MapReduce任务，Hbase是一种在Hadoop之上的NoSQL 的Key/vale数据库。这两种工具是可以同时使用的。就像用Google来搜索，用FaceBook进行社交一样，Hive可以用来进行统计查询，HBase可以用来进行实时查询，数据也可以从Hive写到HBase，或者从HBase写回Hive。

14、hive与HBase的整合

hive与我们的HBase各有千秋，各自有着不同的功能，但是归根接地，hive与hbase的数据最终都是存储在hdfs上面的，一般的我们为了存储磁盘的空间，不会将一份数据存储到多个地方，导致磁盘空间的浪费，我们可以直接将数据存入hbase，然后通过hive整合hbase直接使用sql语句分析hbase里面的数据即可，非常方便

需求一：将hive分析结果的数据，保存到HBase当中去

第一步：拷贝hbase的五个依赖jar包到hive的lib目录下

将我们HBase的五个jar包拷贝到hive的lib目录下
hbase的jar包都在/export/servers/hbase-2.0.0/lib
我们需要拷贝五个jar包名字如下
hbase-client-2.0.0.jar
hbase-hadoop2-compat-2.0.0.jar
hbase-hadoop-compat-2.0.0.jar
hbase-it-2.0.0.jar
hbase-server-2.0.0.jar
我们直接在node03执行以下命令，通过创建软连接的方式来进行jar包的依赖

ln -s /export/servers/hbase-2.0.0/lib/hbase-client-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-client-2.0.0.jar
ln -s /export/servers/hbase-2.0.0/lib/hbase-hadoop2-compat-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-hadoop2-compat-2.0.0.jar
ln -s /export/servers/hbase-2.0.0/lib/hbase-hadoop-compat-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-hadoop-compat-2.0.0.jar
ln -s /export/servers/hbase-2.0.0/lib/hbase-it-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-it-2.0.0.jar
ln -s /export/servers/hbase-2.0.0/lib/hbase-server-2.0.0.jar /export/servers/apache-hive-2.1.0-bin/lib/hbase-server-2.0.0.jar

第二步：修改hive的配置文件

编辑node03服务器上面的hive的配置文件hive-site.xml添加以下两行配置
cd /export/servers/apache-hive-2.1.0-bin/conf
vim hive-site.xml

<property>
                <name>hive.zookeeper.quorum</name>
                <value>node01,node02,node03</value>
        </property>

         <property>
                <name>hbase.zookeeper.quorum</name>
                <value>node01,node02,node03</value>
        </property>

第三步：修改hive-env.sh配置文件添加以下配置

cd /export/servers/apache-hive-2.1.0-bin/conf
vim hive-env.sh

export HADOOP_HOME=/export/servers/hadoop-2.7.5
export HBASE_HOME=/export/servers/hbase-2.0.0
export HIVE_CONF_DIR=/export/servers/apache-hive-2.1.0-bin/conf

第四步：hive当中建表并加载以下数据

hive当中建表
进入hive客户端
cd /export/servers/apache-hive-2.1.0-bin/
bin/hive

创建hive数据库与hive对应的数据库表

create database course;
use course;
create external table if not exists course.score(id int,cname string,score int) row format delimited fields terminated by '\t' stored as textfile ;

准备数据内容如下
node03执行以下命令，准备数据文件
cd /export/
vim hive-hbase.txt

1	zhangsan	60
3	wangwu	30
4	zhaoliu	70

进行加载数据
进入hive客户端进行加载数据

hive (course)> load data local inpath '/export/hive-hbase.txt' into table score;
hive (course)> select * from score;

第五步：创建hive管理表与HBase进行映射

我们可以创建一个hive的管理表与hbase当中的表进行映射，hive管理表当中的数据，都会存储到hbase上面去
hive当中创建内部表

create table course.hbase_score(id int,cname string,score int)  
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
with serdeproperties("hbase.columns.mapping" = "cf:name,cf:score") 
tblproperties("hbase.table.name" = "hbase_score");

通过insert overwrite select 插入数据

insert overwrite table course.hbase_score select id,cname,score from course.score;

第六步：hbase当中查看表hbase_score

进入hbase的客户端查看表hbase_score，并查看当中的数据

hbase(main):023:0> list
TABLE                                                                                       
hbase_score                                                                                 
myuser                                                                                      
myuser2                                                                                     
student                                                                                     
user                                                                                        
5 row(s) in 0.0210 seconds

=> ["hbase_score", "myuser", "myuser2", "student", "user"]
hbase(main):024:0> scan 'hbase_score'
ROW                      COLUMN+CELL                                                        
 1                       column=cf:name, timestamp=1550628395266, value=zhangsan            
 1                       column=cf:score, timestamp=1550628395266, value=80                 
 2                       column=cf:name, timestamp=1550628395266, value=lisi                
 2                       column=cf:score, timestamp=1550628395266, value=60                 
 3                       column=cf:name, timestamp=1550628395266, value=wangwu              
 3                       column=cf:score, timestamp=1550628395266, value=30                 
 4                       column=cf:name, timestamp=1550628395266, value=zhaoliu             
 4                       column=cf:score, timestamp=1550628395266, value=70                 
4 row(s) in 0.0360 seconds

需求二：创建hive外部表，映射HBase当中已有的表模型，

第一步：HBase当中创建表并手动插入加载一些数据

进入HBase的shell客户端，手动创建一张表，并插入加载一些数据进去

create 'hbase_hive_score',{ NAME =>'cf'}
put 'hbase_hive_score','1','cf:name','zhangsan'
put 'hbase_hive_score','1','cf:score', '95'
put 'hbase_hive_score','2','cf:name','lisi'
put 'hbase_hive_score','2','cf:score', '96'
put 'hbase_hive_score','3','cf:name','wangwu'
put 'hbase_hive_score','3','cf:score', '97'

操作成功结果如下：

hbase(main):049:0> create 'hbase_hive_score',{ NAME =>'cf'}
0 row(s) in 1.2970 seconds

=> Hbase::Table - hbase_hive_score
hbase(main):050:0> put 'hbase_hive_score','1','cf:name','zhangsan'
0 row(s) in 0.0600 seconds

hbase(main):051:0> put 'hbase_hive_score','1','cf:score', '95'
0 row(s) in 0.0310 seconds

hbase(main):052:0> put 'hbase_hive_score','2','cf:name','lisi'
0 row(s) in 0.0230 seconds

hbase(main):053:0> put 'hbase_hive_score','2','cf:score', '96'
0 row(s) in 0.0220 seconds

hbase(main):054:0> put 'hbase_hive_score','3','cf:name','wangwu'
0 row(s) in 0.0200 seconds

hbase(main):055:0> put 'hbase_hive_score','3','cf:score', '97'
0 row(s) in 0.0250 seconds

第二步：建立hive的外部表，映射HBase当中的表以及字段

在hive当中建立外部表，
进入hive客户端，然后执行以下命令进行创建hive外部表，就可以实现映射HBase当中的表数据

CREATE external TABLE course.hbase2hive(id int, name string, score int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:name,cf:score") TBLPROPERTIES("hbase.table.name" ="hbase_hive_score");

来源：https://blog.csdn.net/Imflash/article/details/101147313

标签

Hive

HDFS

hbase

数据集成