下面是hive基本练习,持续补充中。
Hive导出数据有几种方式,如何导出数据
1.insert
# 分为导出到本地或者hdfs,还可以格式化输出,指定分隔符 # 导出到本地 0: jdbc:hive2://node01:10000> insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu; INFO : Compiling command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394): insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394); Time taken: 0.107 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394): insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu INFO : Query ID = hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks is set to 0 since there's no reduce operator INFO : Starting Job = job_1573910690864_0002, Tracking URL = http://node01:8088/proxy/application_1573910690864_0002/ INFO : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job -kill job_1573910690864_0002 INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 INFO : 2019-11-16 22:19:40,957 Stage-1 map = 0%, reduce = 0% INFO : 2019-11-16 22:19:42,002 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.51 sec INFO : MapReduce Total cumulative CPU time: 1 seconds 510 msec INFO : Ended Job = job_1573910690864_0002 INFO : Starting task [Stage-0:MOVE] in serial mode INFO : Copying data to local directory /kkb/install/hivedatas/stu3 from hdfs://node01:8020/tmp/hive/anonymous/2d04ba8e-9799-4a31-a93d-557db4086e81/hive_2019-11-16_22-19-32_776_5008666227900564137-1/-mr-10000 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 1 Cumulative CPU: 1.51 sec HDFS Read: 3381 HDFS Write: 285797 SUCCESS INFO : Total MapReduce CPU Time Spent: 1 seconds 510 msec INFO : Completed executing command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394); Time taken: 10.251 seconds INFO : OK No rows affected (10.383 seconds) # 查看本地文件 [hadoop@node01 /kkb/install/hivedatas/stu3]$ cat 000000_0 1clyang # 导出到hdfs 0: jdbc:hive2://node01:10000> insert overwrite directory '/kkb/stu' select * from stu; INFO : Compiling command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852): insert overwrite directory '/kkb/stu' select * from stu INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852); Time taken: 0.173 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852): insert overwrite directory '/kkb/stu' select * from stu INFO : Query ID = hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852 INFO : Total jobs = 3 INFO : Launching Job 1 out of 3 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks is set to 0 since there's no reduce operator INFO : Starting Job = job_1573910690864_0003, Tracking URL = http://node01:8088/proxy/application_1573910690864_0003/ INFO : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job -kill job_1573910690864_0003 INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 INFO : 2019-11-16 22:24:13,962 Stage-1 map = 0%, reduce = 0% INFO : 2019-11-16 22:24:15,018 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.46 sec INFO : MapReduce Total cumulative CPU time: 1 seconds 460 msec INFO : Ended Job = job_1573910690864_0003 INFO : Starting task [Stage-6:CONDITIONAL] in serial mode INFO : Stage-3 is selected by condition resolver. INFO : Stage-2 is filtered out by condition resolver. INFO : Stage-4 is filtered out by condition resolver. INFO : Starting task [Stage-3:MOVE] in serial mode INFO : Moving data to: hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10000 from hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10002 INFO : Starting task [Stage-0:MOVE] in serial mode INFO : Moving data to: /kkb/stu from hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10000 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 1 Cumulative CPU: 1.46 sec HDFS Read: 3315 HDFS Write: 286719 SUCCESS INFO : Total MapReduce CPU Time Spent: 1 seconds 460 msec INFO : Completed executing command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852); Time taken: 9.044 seconds INFO : OK # 查看hdfs [hadoop@node01 /kkb/install/hivedatas/stu3]$ hdfs dfs -cat /kkb/stu/000000_0 19/11/16 22:26:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 1clyang # 可以指定导出本地格式化分隔符,以导出到本地为例 0: jdbc:hive2://node01:10000> insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu; INFO : Compiling command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f): insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f); Time taken: 0.128 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f): insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu INFO : Query ID = hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks is set to 0 since there's no reduce operator INFO : Starting Job = job_1573910690864_0005, Tracking URL = http://node01:8088/proxy/application_1573910690864_0005/ INFO : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job -kill job_1573910690864_0005 INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 INFO : 2019-11-16 22:31:27,083 Stage-1 map = 0%, reduce = 0% INFO : 2019-11-16 22:31:28,139 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.93 sec INFO : MapReduce Total cumulative CPU time: 1 seconds 930 msec INFO : Ended Job = job_1573910690864_0005 INFO : Starting task [Stage-0:MOVE] in serial mode INFO : Copying data to local directory /kkb/install/hivedatas/stu4 from hdfs://node01:8020/tmp/hive/anonymous/2d04ba8e-9799-4a31-a93d-557db4086e81/hive_2019-11-16_22-31-20_415_1737902713220629568-1/-mr-10000 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 1 Cumulative CPU: 1.93 sec HDFS Read: 3526 HDFS Write: 286073 SUCCESS INFO : Total MapReduce CPU Time Spent: 1 seconds 930 msec INFO : Completed executing command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f); Time taken: 8.707 seconds INFO : OK # 查看本地文件,发现以@分隔 [hadoop@node01 /kkb/install/hivedatas/stu4]$ cat 000000_0 1@clyang
2.hadoop命令
数据使用hive保存后存在于hdfs,也可以直接从hdfs将数据拉到本地,使用get命令。
hdfs dfs -get /user/hive/warehouse/student/student.txt /opt/bigdata/data
3.bash shell覆盖追加导出
使用bin/hive -e sql语句或者bin/hive -f sql脚本,将数据覆盖或者追加导出,这里以前者为例,另外sql脚本本质上主要还是sql语句。
# 覆盖写 [hadoop@node01 /kkb/install/hive-1.1.0-cdh5.14.2/bin]$ ./hive -e 'select * from db_hive.stu' > /kkb/install/hivedatas/student2.txt ls: cannot access /kkb/install/spark/lib/spark-assembly-*.jar: No such file or directory SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2019-11-16 22:37:46,342 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/11/16 22:37:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Logging initialized using configuration in file:/kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties OK Time taken: 6.966 seconds, Fetched: 1 row(s) You have new mail in /var/spool/mail/root # 查看结果 [hadoop@node01 /kkb/install/hivedatas]$ cat student2.txt stu.id stu.name 1 clyang # 追加写 [hadoop@node01 /kkb/install/hive-1.1.0-cdh5.14.2/bin]$ ./hive -e 'select * from db_hive.stu' >> /kkb/install/hivedatas/student2.txt ls: cannot access /kkb/install/spark/lib/spark-assembly-*.jar: No such file or directory SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2019-11-16 22:39:03,442 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/11/16 22:39:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Logging initialized using configuration in file:/kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties OK Time taken: 6.056 seconds, Fetched: 1 row(s) You have new mail in /var/spool/mail/root # 查看追加写后结果 [hadoop@node01 /kkb/install/hivedatas]$ cat student2.txt stu.id stu.name 1 clyang stu.id stu.name 1 clyang
4.export导出到hdfs
# 导出 0: jdbc:hive2://node01:10000> export table stu to '/kkb/studentexport'; INFO : Compiling command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a): export table stu to '/kkb/studentexport' INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a); Time taken: 0.126 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a): export table stu to '/kkb/studentexport' INFO : Starting task [Stage-0:COPY] in serial mode INFO : Copying data from file:/tmp/hadoop/e951940a-bcb6-4cd4-be17-0baf5d13615f/hive_2019-11-05_09-43-30_802_7299251851779747447-1/-local-10000/_metadata to hdfs://node01:8020/kkb/studentexport INFO : Copying file: file:/tmp/hadoop/e951940a-bcb6-4cd4-be17-0baf5d13615f/hive_2019-11-05_09-43-30_802_7299251851779747447-1/-local-10000/_metadata INFO : Starting task [Stage-1:COPY] in serial mode INFO : Copying data from hdfs://node01:8020/user/hive/warehouse/db_hive.db/stu to hdfs://node01:8020/kkb/studentexport/data INFO : Copying file: hdfs://node01:8020/user/hive/warehouse/db_hive.db/stu/000000_0 INFO : Completed executing command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a); Time taken: 0.604 seconds INFO : OK # 查看数据 [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /kkb/studentexport 19/11/17 20:29:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 2 items -rwxr-xr-x 3 anonymous supergroup 1330 2019-11-05 09:43 /kkb/studentexport/_metadata drwxr-xr-x - anonymous supergroup 0 2019-11-05 09:43 /kkb/studentexport/data [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /kkb/studentexport/data 19/11/17 20:29:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items -rwxr-xr-x 3 anonymous supergroup 9 2019-11-05 09:43 /kkb/studentexport/data/000000_0 You have new mail in /var/spool/mail/root [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -cat /kkb/studentexport/data/000000_0 19/11/17 20:29:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 1clyang
分区和分桶的区别
分区是文件夹范畴的,就是按照文件夹区分,来存储文件,分桶是文件范畴的,将一个文件根据某个字段按hash取余,拆分为几个文件片段保存,它们都有各自的应用场景:
(1)分区用在按照日期,按照天,或者小时来保存数据,后面查询可以根据需求快速定位到数据,避免了速度慢的全表扫描查询。
(2)分桶则是更加细粒度的存储,可以指定桶的个数n,这样一份文件保存会划分为n份,如果想快速查找可以用tablesample(bucket x out of y)来指定要抽样查询的桶表。
另外分区表里面可能有分桶表。
将数据直接上传到分区目录(hdfs)上,让分区表和数据产生关联有哪些方式?
当创建分区表并将数据导入到分区后,发现导入的数据就保存在对应的分区目录下,并且可以正常查询表内容。如果先将数据导入到事先准备好的分区,然后再创建分区表,是查不到数据的,因为还没有建立分区表数据和hive表的映射关系,需要使用命令来修复,此外还有2种方法。
方法1 msck repair table 表名
提前准备好分区,并将数据上传。
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /mystudentdatas/month=11/ 19/11/17 12:36:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items -rw-r--r-- 3 hadoop supergroup 199 2019-11-17 12:36 /mystudentdatas/month=11/student.csv
创建表格
0: jdbc:hive2://node01:10000> create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/mystudentdatas'; INFO : Compiling command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4): create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/mystudentdatas' INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4); Time taken: 0.149 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4): create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/mystudentdatas' INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4); Time taken: 0.271 seconds INFO : OK
修复表格,使用msck,修复后就可以查看到表的数据了,映射关系建立。
# 修复表格 0: jdbc:hive2://node01:10000> msck repair table student_partition_me; INFO : Compiling command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631): msck repair table student_partition_me INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631); Time taken: 0.011 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631): msck repair table student_partition_me INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631); Time taken: 0.263 seconds INFO : OK No rows affected (0.311 seconds) # 查询,最后字段为分区字段month 0: jdbc:hive2://node01:10000> select id,name,year,gender,month from student_partition_me; INFO : Compiling command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a): select id,name,year,gender,month from student_partition_me INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:year, type:string, comment:null), FieldSchema(name:gender, type:string, comment:null), FieldSchema(name:month, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a); Time taken: 0.133 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a): select id,name,year,gender,month from student_partition_me INFO : Completed executing command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a); Time taken: 0.0 seconds INFO : OK +-----+-------+-------------+---------+--------+--+ | id | name | year | gender | month | +-----+-------+-------------+---------+--------+--+ | 01 | 赵雷 | 1990-01-01 | 男 | 11 | | 02 | 钱电 | 1990-12-21 | 男 | 11 | | 03 | 孙风 | 1990-05-20 | 男 | 11 | | 04 | 李云 | 1990-08-06 | 男 | 11 | | 05 | 周梅 | 1991-12-01 | 女 | 11 | | 06 | 吴兰 | 1992-03-01 | 女 | 11 | | 07 | 郑竹 | 1989-07-01 | 女 | 11 | | 08 | 王菊 | 1990-01-20 | 女 | 11 | +-----+-------+-------------+---------+--------+--+ 8 rows selected (0.214 seconds)
方法2 alter table 表名 add partition(col=xxx)
将数据上传到hdfs
# 注意这里hdfs数据目录换成studentdatas [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /studentdatas/month=12/ 19/11/17 16:51:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items -rw-r--r-- 3 hadoop supergroup 199 2019-11-17 16:51 /studentdatas/month=12/student.csv
创建表格
0: jdbc:hive2://node01:10000> create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/studentdatas'; INFO : Compiling command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699): create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/studentdatas' INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699); Time taken: 0.011 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699): create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/studentdatas' INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699); Time taken: 0.097 seconds INFO : OK
使用alter table指定分区
0: jdbc:hive2://node01:10000> alter table student_partition_pa add partition(month='12'); INFO : Compiling command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662): alter table student_partition_pa add partition(month='12') INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662); Time taken: 0.051 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662): alter table student_partition_pa add partition(month='12') INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662); Time taken: 0.116 seconds INFO : OK
查询数据,ok
0: jdbc:hive2://node01:10000> select id,name,year,gender,month from student_partition_pa; INFO : Compiling command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb): select id,name,year,gender,month from student_partition_pa INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:year, type:string, comment:null), FieldSchema(name:gender, type:string, comment:null), FieldSchema(name:month, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb); Time taken: 0.092 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb): select id,name,year,gender,month from student_partition_pa INFO : Completed executing command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb); Time taken: 0.001 seconds INFO : OK +-----+-------+-------------+---------+--------+--+ | id | name | year | gender | month | +-----+-------+-------------+---------+--------+--+ | 01 | 赵雷 | 1990-01-01 | 男 | 12 | | 02 | 钱电 | 1990-12-21 | 男 | 12 | | 03 | 孙风 | 1990-05-20 | 男 | 12 | | 04 | 李云 | 1990-08-06 | 男 | 12 | | 05 | 周梅 | 1991-12-01 | 女 | 12 | | 06 | 吴兰 | 1992-03-01 | 女 | 12 | | 07 | 郑竹 | 1989-07-01 | 女 | 12 | | 08 | 王菊 | 1990-01-20 | 女 | 12 | +-----+-------+-------------+---------+--------+--+ 8 rows selected (0.162 seconds)
方法3 load data inpath 'hdfs文件路径' into table 表名 partition(col名='xxx')
数据上传到hdfs
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls / 19/11/17 17:21:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 13 items # 上传person.txt到hdfs -rw-r--r-- 3 hadoop supergroup 68 2019-11-17 17:09 /person.txt
创建表格
0: jdbc:hive2://node01:10000> create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '\t' collection items terminated by ',' location '/persondatas'; INFO : Compiling command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666): create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '\t' collection items terminated by ',' location '/persondatas' INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666); Time taken: 0.023 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666): create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '\t' collection items terminated by ',' location '/persondatas' INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666); Time taken: 0.101 seconds INFO : OK
将hdfs上文件加载到分区目录下
0: jdbc:hive2://node01:10000> load data inpath '/person.txt' into table person_partition partition(age='25'); INFO : Compiling command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81): load data inpath '/person.txt' into table person_partition partition(age='25') INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81); Time taken: 0.082 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81): load data inpath '/person.txt' into table person_partition partition(age='25') INFO : Starting task [Stage-0:MOVE] in serial mode INFO : Loading data to table myhive.person_partition partition (age=25) from hdfs://node01:8020/person.txt INFO : Starting task [Stage-1:STATS] in serial mode INFO : Partition myhive.person_partition{age=25} stats: [numFiles=1, numRows=0, totalSize=68, rawDataSize=0] INFO : Completed executing command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81); Time taken: 0.382 seconds INFO : OK
查询数据,ok
0: jdbc:hive2://node01:10000> select * from person_partition; INFO : Compiling command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392): select * from person_partition INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:person_partition.name, type:string, comment:null), FieldSchema(name:person_partition.citys, type:array<string>, comment:null), FieldSchema(name:person_partition.age, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392); Time taken: 0.099 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392): select * from person_partition INFO : Completed executing command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392); Time taken: 0.001 seconds INFO : OK +------------------------+----------------------------------------------+-----------------------+--+ | person_partition.name | person_partition.citys | person_partition.age | +------------------------+----------------------------------------------+-----------------------+--+ | yang | ["beijing","shanghai","tianjin","hangzhou"] | 25 | | messi | ["changchu","chengdu","wuhan"] | 25 | +------------------------+----------------------------------------------+-----------------------+--+
分桶表是否可以通过直接load将数据导入?
桶表需要根据某个字段进行hash取余然后拆分数据保存为不同的文件保存到hdfs,需要通过普通中间表的字段值来计算拆分,因此不能直接load导入,直接导入到hdfs只有一个文件。另外通过桶表的文件类型可以看出,它不是原来的格式了,是一个mr计算后的文件,因此也说明不能用hdfs直接导入。
hive中分区可以提高查询效率,分区是否越多越好,为什么?
hive查询本质上是执行MapReduce任务,如果分区太多,同样体量的数据会产生更多的小文件block块,则会产生的更多的元数据(块的位置、大小等信息),这样对namenode来说压力很大。
另外hive sql会转化为mapreduce任务,分区的一个小文件会对应一个的task,一个task对应一个JVM实例,过多的分区会产生大量的JVM实例,导致JVM频繁的创建与销毁,会降低系统整体性能。
参考博文:
(1)https://www.cnblogs.com/tele-share/p/9829515.html