Hive | 易学教程

Reg : Efficiency among query optimizers in hive

阅读更多关于 Reg : Efficiency among query optimizers in hive

问题 After reading about query optimization techniques I came to know about the below techniques. 1. Indexing - bitmap and BTree 2. Partitioning 3. Bucketing I got the difference between partitioning and bucketing, and when to use them but I'm still confused how indexes actually work. Where is the metadata for index is stored? Is it the namenode which is storing it? I.e., actually while creating partitions or buckets we can see multiple directories in hdfs which explains the query performance

Reg : Efficiency among query optimizers in hive

阅读更多关于 Reg : Efficiency among query optimizers in hive

Writing columns having NULL as some string using OpenCSVSerde - HIVE

阅读更多关于 Writing columns having NULL as some string using OpenCSVSerde - HIVE

问题 I'm using 'org.apache.hadoop.hive.serde2.OpenCSVSerde' to write hive table data. CREATE TABLE testtable ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "," "quoteChar" = "'" ) STORED AS TEXTFILE LOCATION '<location>' AS select * from foo; So, if 'foo' table has empty strings in it, for eg: '1','2','' . The empty strings are written as is to the textfile. The data in textfile reads '1','2','' But if 'foo' contains null values, for eg: '1',

Spark compression when writing to external Hive table

阅读更多关于 Spark compression when writing to external Hive table

问题 I'm inserting into an external hive-parquet table from Spark 2.1 (using df.write.insertInto(...) . By setting e.g. spark.sql("SET spark.sql.parquet.compression.codec=GZIP") I can switch between SNAPPY,GZIP and uncompressed. I can verify that the file size (and filename ending) is influenced by these settings. I get a file named e.g. part-00000-5efbfc08-66fe-4fd1-bebb-944b34689e70.gz.parquet However if I work with partitioned Hive table , this setting does not have any effect, the file size is

How can I find last modified timestamp for a table in Hive?

阅读更多关于 How can I find last modified timestamp for a table in Hive?

问题 I'm trying to fetch last modified timestamp of a table in Hive. 回答1: Please use the below command: show TBLPROPERTIES table_name ('transient_lastDdlTime'); 回答2: Get the transient_lastDdlTime from your Hive table. SHOW CREATE TABLE table_name; Then copy paste the transient_lastDdlTime in below query to get the value as timestamp. SELECT CAST(from_unixtime(your_transient_lastDdlTime_value) AS timestamp); 回答3: You may get the timestamp by executing describe formatted table_name 回答4: you can

数据仓库之数据分析

阅读更多关于数据仓库之数据分析

1. 数据仓库基本介绍　　英文名称为 Data Warehouse ，可简写为DW或DWH。数据仓库的目的是构建面向分析的集成化数据环境，为企业提供决策支持（Decision Support）。它出于分析性报告和决策支持目的而创建。　　数据仓库本身并不“生产”任何数据，同时自身也不需要“消费”任何的数据，数据来源于外部，并且开放给外部应用，这也是为什么叫“仓库”，而不叫“工厂”的原因。 2. 数据仓库的定义　　数据仓库是面向主题的（Subject-Oriented ）、集成的（Integrated）、稳定性的（Non-Volatile）和时变的（Time-Variant ）数据集合，用以支持管理决策。 2.1、面向主题　　数据仓库中的数据是按照一定的主题域进行组织。　　主题是一个抽象的概念，是指用户使用数据仓库进行决策时所关心的重点方面，一个主题通常与多个操作型信息系统相关。 2.2、集成性　　根据决策分析的要求，将分散于各处的源数据进行抽取、筛选、清理、综合等工作，最终集成到数据仓库中。 2.3、稳定性　　数据的相对稳定性，数据仓库中的数据只进行新增，没有更新操作、删除操作处理。　　反映历史变化，以查询分析为主。 2.4、时变性　　数据仓库的数据一般都带有时间属性，随着时间的推移而发生变化，不断地生成主题的新快照 4.

Installing cloudera impala without cloudera manager

阅读更多关于 Installing cloudera impala without cloudera manager

问题 Kindly provide the link for installing the imapala in ubuntu without cloudera manager. Couldn't able to install with official link. Unable to locate package impala using these queries : sudo apt-get install impala # Binaries for daemons sudo apt-get install impala-server # Service start/stop script sudo apt-get install impala-state-store # Service start/stop script 回答1: First you need to get the list of packages and store it in /etc/apt/sources.list.d/ , then update the packages, then you

从简历筛选看怎么写一篇有亮点的简历

阅读更多关于从简历筛选看怎么写一篇有亮点的简历

一、简历是怎么被筛选的？　　今天公司简历太多，筛选不过来，我就帮忙筛选了一次。　　我的筛选原则是：　　　　（1）看年限：首先，看简历的工作年限　　　　（2）看技能：根据工作年限，看工作技能有哪些，有哪些出彩的点，是否和工作年限相匹配　　　　（3）看项目：根据工作技能，再看项目经验里他负责的地方有哪些我比较关注的技能点，在项目经验里有没有体现他说的工作技能　　　　（4）分等级：在以上步骤完成后，确定出哪些是确定的，哪些是待定的　　　　（5）再过滤：把待定的简历进行再次查看过滤，再去掉一些不太有感觉或者眼缘的，这就看个人感觉了二、简历怎么写才好？　　根据我的筛选原则，那么怎么写出来的简历能直接进入面试名单？　　（1）文件命名：文件名的命名要清晰：【张三-java高级-4年工作经验】　　（2）个人信息：简历第一部分就写，直观【名字、电话要显眼；年龄，学历，院校，工作经验，居住地很关键；籍贯、照片什么的可写可不写；】　　（3）工作技能：　　　　　　初级：一定要多写一些，写的具体详细一些，这样第一感觉好，同样的简历，写的详细的和一句话带过的，虽然技能一样，但是　　　　　　　　　简历多的时候，是很少有人愿意仔细思考你一句话带过里包含的隐藏信息的，太多了，没时间，而且写的多，第一感觉好　　　　　　中高级：基础的可以概括总结，此处就要写你的装X技能了，多线程、高并发

Hive-查询

阅读更多关于 Hive-查询

所谓·生活就是一系列下定决心的努力 · 正 · 文 · 来 · 啦 · 查询 * SELECT ... FROM 语句 SELECT 是SQL的射影算子, FROM 标识了从哪个表查询 CREATE TABLE employees( name STRING , salary FLOAT, subordinates ARRY< STRING >, deductions MAP< STRING ,FLOAT> ) PARTITIONED BY (county string ,state STRING ); 1. 比如某一个州有 4 名员工，查询语句如下： eg: hive> SELECT name,salary FROM employees; John Doe 100000.0 Mary Smith 80000.0 Todd Jones 70000.0 Bill King 60000.0 2. 表加别名，在这个查询中不是很有用，但是如果有表链接操作，就很有用了 eg: hive> SELECT e.name,e.salary FROM employees e; 3. 当查询的列是集合时，Hive会使用JSON用于输出，subordinates列为一个数组，输出如下： eg: hive> SELECT name,subordinates FROM employees; John Doe

Hive sql 查询数据库查询 top-n

阅读更多关于 Hive sql 查询数据库查询 top-n

数据库查询*分组排序取top n 要求：按照课程分组，查找每个课程最高的两个成绩。数据文件如下：第一列no为学号，第二列course为课程，第三列score为分数 mysql> select * from lesson; +-------+---------+-------+ | no | course | score | +-------+---------+-------+ | N0101 | Marth | 100 | | N0102 | English | 12 | | N0102 | Chinese | 55 | | N0102 | History | 58 | | N0102 | Marth | 25 | | N0103 | English | 100 | | N0103 | Chinese | 87 | | N0103 | History | 88 | | N0103 | Marth | 72 | | N0104 | English | 20 | | N0104 | Chinese | 60 | | N0104 | History | 88 | | N0104 | Marth | 56 | | N0105 | English | 56 | | N0105 | Chinese | 88 | | N0105 | History | 88 | | N0201 |

订阅 Hive