Hive基础语句操作 | 易学教程

【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>>

Hive完整的DDL建表语法规则

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

Hive 内部表：CREATE TABLE [IF NOT EXISTS] table_name

删除表时，元数据与数据都会被删除

Hive 外部表：CREATE EXTERNAL TABLE [IF NOT EXISTS] table_name LOCATION hdfs_path

删除外部表只删除metastore的元数据，不删除hdfs中的表数据

Hive 建表示例

 CREATE TABLE person(
    id INT,
    name STRING,
    age INT,
    likes ARRAY<STRING>,
    address MAP<STRING,STRING>
  )
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','  #每行以','分割列
  COLLECTION ITEMS TERMINATED BY '-'             # 集合列用'-'分割
  MAP KEYS TERMINATED BY ':'                     # map用':'分割key和value
  LINES TERMINATED BY '\n';                      # 换行结束一行  



CREATE TABLE empty_key_value_store LIKE key_value_store;


CREATE TABLE new_key_value_store  AS  SELECT columA, columB FROM key_value_store;

Hive分区

静态分区

create table day_table (id int, content string) partitioned by (dt string);  
单分区表，按天分区，在表结构中存在id，content，dt三列。以dt为文件夹区分


create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
双分区表，按天和小时分区，在表结构中新增加了dt和hour两列。
先以dt为文件夹，再以hour子文件夹区分


表已创建时，再添加分区：
ALTER TABLE day_table ADD PARTITION (dt='2008-08-08', hour='08')


删除分区（内部表元数据和数据都会删除，外部表只会删除元数据）：
ALTER TABLE day_hour_table DROP PARTITION (dt='2008-08-08', hour='09');


向指定分区添加数据：
1.加载hdfs数据：LOAD DATA INPATH '/user/pv.txt' INTO TABLE day_hour_table PARTITION(dt='2008-08- 08', hour='08'); 
2.加载本地数据：LOAD DATA local INPATH '/user/hua/*' INTO TABLE day_hour partition(dt='2010-07- 07');


预先导入分区数据，但是无法识别怎么办
Msck repair table tablename
直接添加分区

动态分区

开启支持动态分区
set hive.exec.dynamic.partition=true;
默认：true
set hive.exec.dynamic.partition.mode=nostrict;
默认：strict（至少有一个分区列是静态分区）


相关参数
set hive.exec.max.dynamic.partitions.pernode;
每一个执行mr节点上，允许创建的动态分区的最大数量(100)

set hive.exec.max.dynamic.partitions;
所有执行mr节点上，允许创建的所有动态分区的最大数量(1000)

set hive.exec.max.created.files;
所有的mr job允许创建的文件的最大数量(100000)

加载数据
from psn21
insert overwrite table psn22 partition(age, sex)  
select id, name, age, sex, likes, address distribute by age, sex;

Hive自定义函数

Hive的UDF开发只需要重构UDF类的evaluate函数即可。例：
package com.hrj.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

public class helloUDF extends UDF {
    public String evaluate(String str) {
        try {
            return "HelloWorld " + str;
        } catch (Exception e) {
            return null;
        }
    }
} 

Hive  自定义函数调用：
将该java文件编译成helloudf.jar
hive> add jar helloudf.jar;
hive> create temporary function helloworld as 'com.hrj.hive.udf.helloUDF';
hive> select helloworld(t.col1) from t limit 10;
hive> drop temporary function helloworld;
•注意 
1.helloworld为临时的函数，所以每次进入hive都需要add jar以及create temporary操作
2.UDF只能实现一进一出的操作，如果需要实现多进一出，则需要实现UDAF

Hive分桶

来源：oschina

链接：https://my.oschina.net/u/3734816/blog/3155899

标签

HDFS

Hive