hiveql | 易学教程

How to retrieve trips from historical data?

阅读更多关于 How to retrieve trips from historical data?

I have the following table mytable in Hive: id radar_id car_id datetime 1 A21 123 2017-03-08 17:31:19.0 2 A21 555 2017-03-08 17:32:00.0 3 A21 777 2017-03-08 17:33:00.0 4 B15 123 2017-03-08 17:35:22.0 5 B15 555 2017-03-08 17:34:05.0 5 B15 777 2017-03-08 20:50:12.0 6 A21 123 2017-03-09 11:00:00.0 7 C11 123 2017-03-09 11:10:00.0 8 A21 123 2017-03-09 11:12:00.0 9 A21 555 2017-03-09 11:12:10.0 10 B15 123 2017-03-09 11:14:00.0 11 C11 555 2017-03-09 11:20:00.0 I want to get the routes of cars passing through radars A21 and B15 within the same trip. For example, if the date is different for the same

How to partition a Hive Table using range of values for a column

阅读更多关于 How to partition a Hive Table using range of values for a column

I have a Hive Table with 2 columns.Employee ID and Salary. Data is something like given below. Employee ID Salary 1 10000.08 2 20078.67 3 20056.45 4 30000.76 5 10045.14 6 43567.76 I want to create Partitions based on Salary Column.For Example Partition for salary range 10000 to 20000, 20001 to 30000. How do i achieve this. leftjoin Hive does not support range partitioning, but you can calculate ranges during data load. Create table partitioned by salary_range: create table your_table ( employee_id bigint, salary double ) partitioned by (salary_range bigint) insert using case for salary range

How to retrieve trips from historical data?

阅读更多关于 How to retrieve trips from historical data?

问题 I have the following table mytable in Hive: id radar_id car_id datetime 1 A21 123 2017-03-08 17:31:19.0 2 A21 555 2017-03-08 17:32:00.0 3 A21 777 2017-03-08 17:33:00.0 4 B15 123 2017-03-08 17:35:22.0 5 B15 555 2017-03-08 17:34:05.0 5 B15 777 2017-03-08 20:50:12.0 6 A21 123 2017-03-09 11:00:00.0 7 C11 123 2017-03-09 11:10:00.0 8 A21 123 2017-03-09 11:12:00.0 9 A21 555 2017-03-09 11:12:10.0 10 B15 123 2017-03-09 11:14:00.0 11 C11 555 2017-03-09 11:20:00.0 I want to get the routes of cars

What does the following fields: 'totalSize' and 'rawDataSize' mean in DESCRIBE EXTENDED query output in hive?

阅读更多关于 What does the following fields: 'totalSize' and 'rawDataSize' mean in DESCRIBE EXTENDED query output in hive?

If one runs DESCRIBE EXTENDED command on any hive table the result presents totalSize and rawDataSize values near the end of the output. What do these fields mean? Ex: hive > DESCRIBE EXTENDED <TableName> Output Results: Table(tableName:TablenameXXXXX, dbName:XXxXXX, .......... ....................... numRows=116429472, totalSize=3835205544, rawDataSize=35040221600}) rawDataSize is the size of original data set, totalSize is amount of storage it takes. It is applicable for ORC file format, as it compresses the data totalSize will be lesser than rawDataSize. The size of data is described by two

Hive SELECT statement to create an ARRAY of STRUCTS

阅读更多关于 Hive SELECT statement to create an ARRAY of STRUCTS

I'm having trouble selecting into an ARRAY of STRUCTS in Hive. My source table looks like this: +-------------+--+ | field | +-------------+--+ | id | | fieldid | | fieldlabel | | fieldtype | | answer_id | | unitname | +-------------+--+ This is survey data, where the id is the survey id, the four fields in the middle are response data, and the unitname is the the business unit that the survey pertains to. I need to create an array of structs for all of the answers for each survey id. I thought this would work, but it doesn't: select id, array( named_struct( "field_id", fieldid, "field_label",

Hive join optimization

阅读更多关于 Hive join optimization

I have two sets of data both stored in an S3 bucket which I need to process in Hive and store the output back to S3. Sample rows from each datasets are as follows: DataSet 1: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126"} DataSet2: {"requestId":"TADS6152JHGJH5435","userAgent":"Mozilla"} I need to join these two data sets based on the requestId and output a combined row as: Output: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126","userAgent":"Mozilla"} The requestIds in dataset 1 is a proper subset of of the requestids in

Hive - Can one extract common options for reuse in other scripts?

阅读更多关于 Hive - Can one extract common options for reuse in other scripts?

I have two Hive scripts which look like this: Script A: SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=non-strict; SET hive.exec.parallel=true; ... do something ... Script B: SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=non-strict; SET hive.exec.parallel=true; ... do something else ... The options that we set at the beginning of each script are the same. Is it possible somehow to extract them out to a common place (for example, into a commonoptions.sql) so that our scripts look like this: Script A: <run commonoptions.sql> ... do

Apache Hive: How to convert string to timestamp?

阅读更多关于 Apache Hive: How to convert string to timestamp?

I'm trying to convert the string in REC_TIME column to a timestamp format in hive. Ex: Sun Jul 31 09:28:20 UTC 2016 => 2016-07-31 09:28:20 SELECT xxx, UNIX_TIMESTAMP(REC_TIME, "E M dd HH:mm:ss z yyyy") FROM wlogs LIMIT 10; When I execute the above SQL it returns a NULL value. Try this : select from_unixtime(unix_timestamp("Sun Jul 31 09:28:20 UTC 2016","EEE MMM dd HH:mm:ss zzz yyyy")); This works fine if your hive cluster has UTC timezone. Say suppose your server is in CST then you need to do as below to get to UTC; select to_utc_timestamp(from_unixtime(unix_timestamp("Sun Jul 31 09:28:20 UTC

Hive join optimization

阅读更多关于 Hive join optimization

问题 I have two sets of data both stored in an S3 bucket which I need to process in Hive and store the output back to S3. Sample rows from each datasets are as follows: DataSet 1: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126"} DataSet2: {"requestId":"TADS6152JHGJH5435","userAgent":"Mozilla"} I need to join these two data sets based on the requestId and output a combined row as: Output: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":

HiveQL and rank()

阅读更多关于 HiveQL and rank()

问题 I can't understand HiveQL rank(). I've found a couple of implementations of rank UDF's on the WWW, such as Edward's nice example. I can load and access the functions, but I can't get them to do what I want. Here is a detailed example: Loading the UDF into the CLI process: $ javac -classpath /home/hadoop/hadoop/hadoop-core-1.0.4.jar:/home/hadoop/hive/lib/hive-exec-0.10.0.jar com/m6d/hiveudf/Rank2.java $ jar -cvf Rank2.jar com/m6d/hiveudf/Rank2.class hive> ADD JAR /home/hadoop/MyDemo/Rank2.jar;