【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>>
Hive完整的DDL建表语法规则
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later) [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)] ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) [STORED AS DIRECTORIES] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later) ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later) [AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
Hive 内部表 :CREATE TABLE [IF NOT EXISTS] table_name
删除表时,元数据与数据都会被删除
Hive 外部表 :CREATE EXTERNAL TABLE [IF NOT EXISTS] table_name LOCATION hdfs_path
删除外部表只删除metastore的元数据,不删除hdfs中的表数据
Hive 建表示例
CREATE TABLE person( id INT, name STRING, age INT, likes ARRAY<STRING>, address MAP<STRING,STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' #每行以','分割列 COLLECTION ITEMS TERMINATED BY '-' # 集合列用'-'分割 MAP KEYS TERMINATED BY ':' # map用':'分割key和value LINES TERMINATED BY '\n'; # 换行结束一行 CREATE TABLE empty_key_value_store LIKE key_value_store; CREATE TABLE new_key_value_store AS SELECT columA, columB FROM key_value_store;
Hive分区
静态分区
create table day_table (id int, content string) partitioned by (dt string); 单分区表,按天分区,在表结构中存在id,content,dt三列。以dt为文件夹区分 create table day_hour_table (id int, content string) partitioned by (dt string, hour string); 双分区表,按天和小时分区,在表结构中新增加了dt和hour两列。 先以dt为文件夹,再以hour子文件夹区分 表已创建时,再添加分区: ALTER TABLE day_table ADD PARTITION (dt='2008-08-08', hour='08') 删除分区(内部表元数据和数据都会删除,外部表只会删除元数据): ALTER TABLE day_hour_table DROP PARTITION (dt='2008-08-08', hour='09'); 向指定分区添加数据: 1.加载hdfs数据:LOAD DATA INPATH '/user/pv.txt' INTO TABLE day_hour_table PARTITION(dt='2008-08- 08', hour='08'); 2.加载本地数据:LOAD DATA local INPATH '/user/hua/*' INTO TABLE day_hour partition(dt='2010-07- 07'); 预先导入分区数据,但是无法识别怎么办 Msck repair table tablename 直接添加分区
动态分区
开启支持动态分区 set hive.exec.dynamic.partition=true; 默认:true set hive.exec.dynamic.partition.mode=nostrict; 默认:strict(至少有一个分区列是静态分区) 相关参数 set hive.exec.max.dynamic.partitions.pernode; 每一个执行mr节点上,允许创建的动态分区的最大数量(100) set hive.exec.max.dynamic.partitions; 所有执行mr节点上,允许创建的所有动态分区的最大数量(1000) set hive.exec.max.created.files; 所有的mr job允许创建的文件的最大数量(100000) 加载数据 from psn21 insert overwrite table psn22 partition(age, sex) select id, name, age, sex, likes, address distribute by age, sex;
Hive自定义函数
Hive的UDF开发只需要重构UDF类的evaluate函数即可。例: package com.hrj.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public class helloUDF extends UDF { public String evaluate(String str) { try { return "HelloWorld " + str; } catch (Exception e) { return null; } } } Hive 自定义函数调用: 将该java文件编译成helloudf.jar hive> add jar helloudf.jar; hive> create temporary function helloworld as 'com.hrj.hive.udf.helloUDF'; hive> select helloworld(t.col1) from t limit 10; hive> drop temporary function helloworld; •注意 1.helloworld为临时的函数,所以每次进入hive都需要add jar以及create temporary操作 2.UDF只能实现一进一出的操作,如果需要实现多进一出,则需要实现UDAF
Hive分桶
来源:oschina
链接:https://my.oschina.net/u/3734816/blog/3155899