Hive

Hive SQL查询效率提升之Analyze方案的实施

风格不统一 提交于 2021-02-17 13:50:16
0.简介 Analyze,分析表(也称为计算统计信息)是一种内置的Hive操作,可以执行该操作来收集表上的元数据信息。这可以极大的改善表上的查询时间,因为它收集构成表中数据的行计数,文件计数和文件大小(字节),并在执行之前将其提供给查询计划程序。 <!-- more --> 1.如何分析表? 基础分析语句 ANALYZE TABLE my_database_name.my_table_name COMPUTE STATISTICS; 这是一个基础分析语句,不限制是否存在表分区,如果你是分区表更应该定期执行。 分析特定分区 ANALYZE TABLE my_database_name.my_table_name PARTITION (YEAR=2019, MONTH=5, DAY=12) COMPUTE STATISTICS; 这是一个细粒度的分析语句。它收集指定的分区上的元数据,并将该信息存储在Hive Metastore中已进行查询优化。该信息包括每列,不同值的数量,NULL值的数量,列的平均大小,平均值或列中所有值的总和(如果类型为数字)和值的百分数。 分析列 ANALYZE TABLE my_database_name.my_table_name COMPUTE STATISTICS FOR column1, column2, column3; 它收集指定列上的元数据

Check if a hive table is partitioned on a given column

元气小坏坏 提交于 2021-02-17 04:45:04
问题 I have a list of hive tables , of which some are partitioned. Given a column I need to check if a particular table is partitioned on that column or not. I have searched and found that desc formatted tablename would result in all the details of the table. Since I have to iterate over all the tables and get the list , desc formatted would not help. Is there any other way this can be done. 回答1: You can connect directly to metastore and query it: metastore=# select d."NAME" as DATABASE, t."TBL

Check if a hive table is partitioned on a given column

◇◆丶佛笑我妖孽 提交于 2021-02-17 04:43:40
问题 I have a list of hive tables , of which some are partitioned. Given a column I need to check if a particular table is partitioned on that column or not. I have searched and found that desc formatted tablename would result in all the details of the table. Since I have to iterate over all the tables and get the list , desc formatted would not help. Is there any other way this can be done. 回答1: You can connect directly to metastore and query it: metastore=# select d."NAME" as DATABASE, t."TBL

Check if a hive table is partitioned on a given column

大城市里の小女人 提交于 2021-02-17 04:42:12
问题 I have a list of hive tables , of which some are partitioned. Given a column I need to check if a particular table is partitioned on that column or not. I have searched and found that desc formatted tablename would result in all the details of the table. Since I have to iterate over all the tables and get the list , desc formatted would not help. Is there any other way this can be done. 回答1: You can connect directly to metastore and query it: metastore=# select d."NAME" as DATABASE, t."TBL

How to query data from gz file of Amazon S3 using Qubole Hive query?

旧时模样 提交于 2021-02-16 15:35:34
问题 I need get specific data from gz. how to write the sql? can I just sql as table database?: Select * from gz_File_Name where key = 'keyname' limit 10. but it always turn back with an error. 回答1: You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this: create external table hive_schema.your_table ( col_one string, col_two string ) stored as textfile --specify your file type, or use serde LOCATION 's3:/

Calculate difference between start_time and end_time in seconds from unix_time yyyy-MM-dd HH:mm:ss

妖精的绣舞 提交于 2021-02-16 14:52:57
问题 I'm still learning SQL and I found a couple of solutions on SQL Server or Postgreы, but it doesn't seen to work on HUE DATEDIFF , only allows me to calculate difference between days seconds, minutes are not available. Help is very welcome. I was able to split the timestamp with substring_index , but then I can't find the right approach to compare and subtract start_time to end_time in order to obtain the accurate account of seconds. I can't find time functions so I'm assuming I should

漫谈数据仓库之维度建模

我是研究僧i 提交于 2021-02-16 10:12:18
0x00 前言 下面的内容,是笔者在学习和工作中的一些总结,其中概念性的内容大多来自书中,实践性的内容大多来自自己的工作和个人理解。由于资历尚浅,难免会有很多错误,望批评指正! 概述 数据仓库包含的内容很多,它可以包括架构、建模和方法论。对应到具体工作中的话,它可以包含下面的这些内容: 以Hadoop、Spark、Hive等组建为中心的数据架构体系。 各种数据建模方法,如维度建模。 调度系统、元数据系统、ETL系统、可视化系统这类辅助系统。 我们暂且不管数据仓库的范围到底有多大,在数据仓库体系中,数据模型的核心地位是不可替代的。为什么要数据仓库建模? 数据模型是数据组织和存储方法,它强调从业务、数据存取和使用角度合理存储数据。有了适合业务和基础数据存储环境的模型,那么大数据就能获得以下好处: 性能 :良好的数据模型能帮助我们快速查询所需要的数据,减少数据的 I/O 吞吐。 成本 :良好的数据模型能极大地减少不必要的数据冗余,也能实现计算结果复用,极大的降低大数据系统中的存储和计算成本。 效率 :良好的数据模型能极大地改善用户使用数据的体验,提高使用数据的效率。 质量 :良好的数据模型能改善数据统计口径的不一致性,减少数据计算错误的可能性。 因此,下面的将详细地阐述数据建模中的典型代表:维度建模,对它的的相关理论以及实际使用做深入的分析。 文章结构 本文将按照下面的顺序进行阐述:

大数据之Hive

…衆ロ難τιáo~ 提交于 2021-02-16 09:45:11
Hive Author: Lijb Email: lijb1121@163.com Hive介绍: hive是基于Hadoop的一个数据仓库工具,可以用来进行数据踢群转换加载(ETL),这是一种可以存储、查询和分析存储在Hadoop中的大规模数据机制。可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。 ETL介绍: 什么是etl Extract-Transform-Load): 1、用来描述将数据从来源端经过抽取(extract)、转换(transform)、加载(load)至目的端的过程。ETL一词较常用在数据仓库。是一个数据清洗工具 2、实现ETL,首先要实现ETL转换的过程。体现为以下几个方面: 1、空值处理:可捕获字段空值,进行加载或替换为其他含义数据,并可根据字段空值实现分流加载到不同目标库。 2、规范化数据格式:可实现字段格式约束定义,对于数据源中时间、数值、字符等数据,可自定义加载格式。 3、拆分数据:依据业务需求对字段可进行分解。例,主叫号 861082585313-8148,可进行区域码和电话号码分解。 4、验证数据正确性:可利用Lookup及拆分功能进行数据验证。例如,主叫号861082585313-8148,进行区域码和电话号码分解后,可利用Lookup返回主叫网关或交换机记载的主叫地区

How to convert “2019-11-02T20:18:00Z” to timestamp in HQL?

拟墨画扇 提交于 2021-02-15 07:28:06
问题 I have datetime string "2019-11-02T20:18:00Z" . How can I convert it into timestamp in Hive HQL? 回答1: If you want preserve milliseconds then remove Z , replace T with space and convert to timestamp: select timestamp(regexp_replace("2019-11-02T20:18:00Z", '^(.+?)T(.+?)Z$','$1 $2')); Result: 2019-11-02 20:18:00 Also it works with milliseconds: select timestamp(regexp_replace("2019-11-02T20:18:00.123Z", '^(.+?)T(.+?)Z$','$1 $2')); Result: 2019-11-02 20:18:00.123 Using from_unixtime(unix

What happens when a hive insert is failed halfway?

岁酱吖の 提交于 2021-02-15 03:13:21
问题 Suppose an insert is expected to load 100 records in hive and 40 records have been inserted and the insert failed for some reason. will the transaction roll back completely, undoing 40 records which were inserted? or Will we see 40 records in the hive table even after the insert query failed? 回答1: The operation is atomic (even for non-ACID table): If you inserting or rewriting data using HiveQL, it writes data into temporary location and only if the command succeeds files are moved to the