hiveql | 易学教程

Import from MySQL to Hive using Sqoop

阅读更多关于 Import from MySQL to Hive using Sqoop

问题 I have to import > 400 million rows from a MySQL table(having a composite primary key) into a PARTITIONED Hive table Hive via Sqoop. The table has data for two years with a column departure date ranging from 20120605 to 20140605 and thousands of records for one day. I need to partition the data based on the departure date. The versions : Apache Hadoop - 1.0.4 Apache Hive - 0.9.0 Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0.0 As per my knowledge, there are 3 approaches: MySQL -> Non-partitioned

Hive query to get max of count

阅读更多关于 Hive query to get max of count

问题 My input file is like this: id,phnName,price,model 1,iphone,2000,abc 2,iphone,3000,abc1 3,nokia,4000,abc2 4,sony,5000,abc3 5,nokia,6000,abc4 6,iphone,7000,abc5 7,nokia,8500,abc6 I want to write a hive query to get the max count of a particular phone. output: iphone 3 nokia 3 till now I've tried the following query: select d.phnName,count(*) from phnDetails d group by d.phnName and got output like this: iphone 3 nokia 3 sony 1 Help me to retrieve only the max value. 回答1: I have the query

Group varying number of rows as columns in Hive table

阅读更多关于 Group varying number of rows as columns in Hive table

问题 I have a Hive table that contains userIDs and some variable choice, and basically looks like this: userID selection 1 A 1 D 1 F 2 A 2 C What I would like to do is condense this information and end up with something like: userID selection1 selection2 selection3 1 A D F 2 A C Is this even possible? It isn't clear to me how to do this grouping, given that the number of possible selections varies with the user. It would even be fine if I could do something like: userID selection 1 A,D,F 2 A,C I

How to Import XML data into Hive using attributes as columns

阅读更多关于 How to Import XML data into Hive using attributes as columns

问题 I am pretty new with HiveQL and I am kinda stuck :S I have a data stored in xml format and I want to extract fields from this xml file in a Hive table of columns (string Titles_2 , sting Artists_2, string Albums_2) . A Sample of the xml data: <?xml version="1.0" encoding="UTF-8"?><MC><SC><S uid="2" gen="" yr="2011" art="Samsung" cmp="<unknown>" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Over the horizon"/><S uid="37" gen="" yr="2010" art="Jason Derulo" cmp="<unknown>" fld="/mnt/sdcard

Automated List of Hive External tables

阅读更多关于 Automated List of Hive External tables

问题 I have to create an automated process to list all external tables in Hive and do a record count on those tables. I should do this as a daily job. I tried this by hard coding all the external table names, but this is not accepted as the tables keep on changing once in a month. I have gone through different approaches like [show tables] and executing query in metastore DB. But these will not help me in automating the process. Is there a better approach to implement this in Hive. 回答1: Something

How to select periods of time with empty data?

阅读更多关于 How to select periods of time with empty data?

问题 I want to find out all periods with empty data, given the following table my_table : id day 29 2017-06-05 26 2017-06-05 30 2017-06-06 30 2017-06-06 21 2017-06-06 21 2017-07-01 29 2017-07-01 30 2017-07-20 The answer would be: Empty_start Empty_end 2017-06-07 2017-06-30 2017-07-02 2017-07-19 It's important that the number of months is considered. For example, in the first row the answer 2017-06- 31 would be incorrect. How can I write this query in Hive? 回答1: You can use lag() or lead() : select

Part of Filename as a column in Hive Table

阅读更多关于 Part of Filename as a column in Hive Table

问题 I want to get the first part of my filename as a column in my Hive Table My filename is : 20151102114450.46400_Always_1446482638967.xml I wrote a query (below query) using regex in Hive of Microsoft Azure to get the first part of it i.e., 20151102114450 But when I run query I am getting the output as 20151102164358 select CAST(regexp_replace(regexp_replace(regexp_replace(CAST(CAST(regexp_replace(split(INPUT__FILE__NAME,'[_]')[2],'.xml','') AS BIGINT) as TimeStamp),':',''),'-',''),' ','') AS

FAILED: ParseException line 1:21 cannot recognize input near '<EOF>' '<EOF>' '<EOF>' in table name

阅读更多关于 FAILED: ParseException line 1:21 cannot recognize input near '' '' '' in table name

问题 Command: hive -e "use xxx;DROP TABLE IF EXISTS `xxx.flashsaleeventproducts_hist`;CREATE EXTERNAL TABLE `xxx.flashsaleeventproducts_hist`(`event_id` string,`group_code` string,`id` string,`is_deleted` int,`price` int,`price_guide` int,`product_code` int,`product_id` string,`quantity_each_person_limit` int,`quantity_limit_plan` int,`sort_num` int,`update_time` bigint,`meta_offset` bigint,`meta_status` int,`meta_start_time` bigint)PARTITIONED BY(`cur_date` string,`cur_hour` string) ROW FORMAT

Describe table shows “from deserializer” for column comments in Hue Hive Avro format

阅读更多关于 Describe table shows “from deserializer” for column comments in Hue Hive Avro format

问题 We have observed that when we store the data in Avro format, it converts byte stream to binary, due to which all the comments gets converted to “from deserializer”. We found a jira bug for this issue as well, few confirms, this issue has been addressed with 0.13 version. We are using hive 1.1 (Cloudera). But we are still facing the issue. Jira :- https://issues.apache.org/jira/browse/HIVE-6681 https://www.bountysource.com/issues/1320154-describe-on-a-table-returns-from-deserializer-for-column

Hive - Rolling up the amount balance from leaf node to top parent

阅读更多关于 Hive - Rolling up the amount balance from leaf node to top parent

问题 I have Hierarchy table have Organization level Parent Child relationships. and other table has account balance for the lowest level child in hierarchy table. I need to find all levels of Hierarchy starting from top child to lowest child. All top parent_node have top end parent as "****" . Please suggest hive query to solve this problem. Input Table: Hierarchy Table: +---------------+----------------+ |parent_node_id | child_node_id | +---------------+----------------+ | C1 | C11 | +----------