hiveql

Hive SELECT statement to create an ARRAY of STRUCTS

点点圈 提交于 2019-11-30 19:03:19
问题 I'm having trouble selecting into an ARRAY of STRUCTS in Hive. My source table looks like this: +-------------+--+ | field | +-------------+--+ | id | | fieldid | | fieldlabel | | fieldtype | | answer_id | | unitname | +-------------+--+ This is survey data, where the id is the survey id, the four fields in the middle are response data, and the unitname is the the business unit that the survey pertains to. I need to create an array of structs for all of the answers for each survey id. I

Hive LEFT SEMI JOIN for 'NOT EXISTS'

余生长醉 提交于 2019-11-30 18:24:46
I have two tables with a single key column. Keys in table a are subset of all keys in table b. I need to select keys from table b that are NOT in table a. Here is a citation from Hive manual: "LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way. As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or

Apache Hive: How to convert string to timestamp?

安稳与你 提交于 2019-11-30 17:11:24
问题 I'm trying to convert the string in REC_TIME column to a timestamp format in hive. Ex: Sun Jul 31 09:28:20 UTC 2016 => 2016-07-31 09:28:20 SELECT xxx, UNIX_TIMESTAMP(REC_TIME, "E M dd HH:mm:ss z yyyy") FROM wlogs LIMIT 10; When I execute the above SQL it returns a NULL value. 回答1: Try this : select from_unixtime(unix_timestamp("Sun Jul 31 09:28:20 UTC 2016","EEE MMM dd HH:mm:ss zzz yyyy")); This works fine if your hive cluster has UTC timezone. Say suppose your server is in CST then you need

How to create an external Hive table with column typed Timestamp

丶灬走出姿态 提交于 2019-11-30 16:29:27
I want to create an external Hive table from a text file containing epoch in HDFS. Let's say the file is located at the /user/me/test.txt . Here's the file content: 1354183921 1354183922 I have Hive 0.8.1 installed and should be able to use type Timestamp, so I created the table: hive> CREATE EXTERNAL TABLE test1 (epoch Timestamp) LOCATION '/user/me'; Then I queried the table: SELECT * FROM test1; and got the following exception: Failed with exception java.io.IOException:java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff] Have I missed anything when

Hive Data selecting latest value based on timestamp

痞子三分冷 提交于 2019-11-30 16:18:54
I have a table having the following columns. C1,C2,Process TimeStamp,InsertDateTimeStamp p1,v1,2014-01-30 12:15:23,2013-10-01 05:34:23 p1,v2,2014-01-31 05:11:34,2013-12-01 06:12:31 p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31 p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20 Now, I want to fetch unique row for each primary key based on latest Process TimeStamp . If Process TimeStamp is same then row having latest InsertDateTimeStamp should be chosen. So, my result should be. p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 p2,v4,2014-02-01 09:22:52,2013

Hive Data selecting latest value based on timestamp

倖福魔咒の 提交于 2019-11-30 16:14:09
问题 I have a table having the following columns. C1,C2,Process TimeStamp,InsertDateTimeStamp p1,v1,2014-01-30 12:15:23,2013-10-01 05:34:23 p1,v2,2014-01-31 05:11:34,2013-12-01 06:12:31 p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31 p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20 Now, I want to fetch unique row for each primary key based on latest Process TimeStamp . If Process TimeStamp is same then row having latest InsertDateTimeStamp should be chosen.

What is the difference between 'InputFormat, OutputFormat' & 'Stored as' in Hive?

泄露秘密 提交于 2019-11-30 15:34:17
问题 Im new to Bigdata and currently learning Hive. I understood the concept of InputFormat & OutputFormat in Hive as part of SerDe. I also understood that 'Stored as' is used to store a file in a particular format just like InputFormat. But I don't understand what is the significant difference between using the 'InputFormat, OutputFormat' & 'Stored as'. Any help is appreciated. 回答1: Hive has a lot of options of how to store the data. You can either use external storage where Hive would just wrap

What is the difference between 'InputFormat, OutputFormat' & 'Stored as' in Hive?

為{幸葍}努か 提交于 2019-11-30 14:39:32
Im new to Bigdata and currently learning Hive. I understood the concept of InputFormat & OutputFormat in Hive as part of SerDe. I also understood that 'Stored as' is used to store a file in a particular format just like InputFormat. But I don't understand what is the significant difference between using the 'InputFormat, OutputFormat' & 'Stored as'. Any help is appreciated. Hive has a lot of options of how to store the data. You can either use external storage where Hive would just wrap some data from other place or you can create standalone table from start in hive warehouse . Input and

How to calculate median in Hive

走远了吗. 提交于 2019-11-30 12:43:04
问题 I have a hive table, name age sal A 45 1222 B 50 4555 c 44 8888 D 78 1222 E 12 7888 F 23 4555 I want to calculate median of age column. Below is my approach select min(age) as HMIN,max(age) as HMAX,count(age) as HCount, IF(count(age)%2=0,'even','Odd') as PCOUNT from v_act_subjects_bh; Appreciate any query suggestion 回答1: You can use the percentile function to compute the median. Try this: select percentile(cast(age as BIGINT), 0.5) from table_name 来源: https://stackoverflow.com/questions

Convert string to timestamp in Hive

三世轮回 提交于 2019-11-30 09:37:44
I have the following string representation of a timestamp in my Hive table: 20130502081559999 I need to convert it to a string like so: 2013-05-02 08:15:59 I have tried following ({code} >>> {result}): from_unixtime(unix_timestamp('20130502081559999', 'yyyyMMddHHmmss')) >>> 2013-05-03 00:54:59 from_unixtime(unix_timestamp('20130502081559999', 'yyyyMMddHHmmssMS')) >>> 2013-09-02 08:15:59 from_unixtime(unix_timestamp('20130502081559999', 'yyyyMMddHHmmssMS')) >>> 2013-05-02 08:10:39 Converting to a timestamp and then unixtime seems weird, what is the proper way to do this? EDIT I figured it out.