hiveql

Import from MySQL to Hive using Sqoop

限于喜欢 提交于 2019-12-11 13:57:15
问题 I have to import > 400 million rows from a MySQL table(having a composite primary key) into a PARTITIONED Hive table Hive via Sqoop. The table has data for two years with a column departure date ranging from 20120605 to 20140605 and thousands of records for one day. I need to partition the data based on the departure date. The versions : Apache Hadoop - 1.0.4 Apache Hive - 0.9.0 Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0.0 As per my knowledge, there are 3 approaches: MySQL -> Non-partitioned

Hive query to get max of count

為{幸葍}努か 提交于 2019-12-11 13:19:28
问题 My input file is like this: id,phnName,price,model 1,iphone,2000,abc 2,iphone,3000,abc1 3,nokia,4000,abc2 4,sony,5000,abc3 5,nokia,6000,abc4 6,iphone,7000,abc5 7,nokia,8500,abc6 I want to write a hive query to get the max count of a particular phone. output: iphone 3 nokia 3 till now I've tried the following query: select d.phnName,count(*) from phnDetails d group by d.phnName and got output like this: iphone 3 nokia 3 sony 1 Help me to retrieve only the max value. 回答1: I have the query

Group varying number of rows as columns in Hive table

一曲冷凌霜 提交于 2019-12-11 11:25:57
问题 I have a Hive table that contains userIDs and some variable choice, and basically looks like this: userID selection 1 A 1 D 1 F 2 A 2 C What I would like to do is condense this information and end up with something like: userID selection1 selection2 selection3 1 A D F 2 A C Is this even possible? It isn't clear to me how to do this grouping, given that the number of possible selections varies with the user. It would even be fine if I could do something like: userID selection 1 A,D,F 2 A,C I

How to Import XML data into Hive using attributes as columns

谁都会走 提交于 2019-12-11 09:47:28
问题 I am pretty new with HiveQL and I am kinda stuck :S I have a data stored in xml format and I want to extract fields from this xml file in a Hive table of columns (string Titles_2 , sting Artists_2, string Albums_2) . A Sample of the xml data: <?xml version="1.0" encoding="UTF-8"?><MC><SC><S uid="2" gen="" yr="2011" art="Samsung" cmp="<unknown>" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Over the horizon"/><S uid="37" gen="" yr="2010" art="Jason Derulo" cmp="<unknown>" fld="/mnt/sdcard

Automated List of Hive External tables

走远了吗. 提交于 2019-12-11 08:48:58
问题 I have to create an automated process to list all external tables in Hive and do a record count on those tables. I should do this as a daily job. I tried this by hard coding all the external table names, but this is not accepted as the tables keep on changing once in a month. I have gone through different approaches like [show tables] and executing query in metastore DB. But these will not help me in automating the process. Is there a better approach to implement this in Hive. 回答1: Something

How to select periods of time with empty data?

旧巷老猫 提交于 2019-12-11 07:37:10
问题 I want to find out all periods with empty data, given the following table my_table : id day 29 2017-06-05 26 2017-06-05 30 2017-06-06 30 2017-06-06 21 2017-06-06 21 2017-07-01 29 2017-07-01 30 2017-07-20 The answer would be: Empty_start Empty_end 2017-06-07 2017-06-30 2017-07-02 2017-07-19 It's important that the number of months is considered. For example, in the first row the answer 2017-06- 31 would be incorrect. How can I write this query in Hive? 回答1: You can use lag() or lead() : select

Part of Filename as a column in Hive Table

丶灬走出姿态 提交于 2019-12-11 06:59:29
问题 I want to get the first part of my filename as a column in my Hive Table My filename is : 20151102114450.46400_Always_1446482638967.xml I wrote a query (below query) using regex in Hive of Microsoft Azure to get the first part of it i.e., 20151102114450 But when I run query I am getting the output as 20151102164358 select CAST(regexp_replace(regexp_replace(regexp_replace(CAST(CAST(regexp_replace(split(INPUT__FILE__NAME,'[_]')[2],'.xml','') AS BIGINT) as TimeStamp),':',''),'-',''),' ','') AS

FAILED: ParseException line 1:21 cannot recognize input near '<EOF>' '<EOF>' '<EOF>' in table name

左心房为你撑大大i 提交于 2019-12-11 06:15:31
问题 Command: hive -e "use xxx;DROP TABLE IF EXISTS `xxx.flashsaleeventproducts_hist`;CREATE EXTERNAL TABLE `xxx.flashsaleeventproducts_hist`(`event_id` string,`group_code` string,`id` string,`is_deleted` int,`price` int,`price_guide` int,`product_code` int,`product_id` string,`quantity_each_person_limit` int,`quantity_limit_plan` int,`sort_num` int,`update_time` bigint,`meta_offset` bigint,`meta_status` int,`meta_start_time` bigint)PARTITIONED BY(`cur_date` string,`cur_hour` string) ROW FORMAT

Describe table shows “from deserializer” for column comments in Hue Hive Avro format

为君一笑 提交于 2019-12-11 05:07:02
问题 We have observed that when we store the data in Avro format, it converts byte stream to binary, due to which all the comments gets converted to “from deserializer”. We found a jira bug for this issue as well, few confirms, this issue has been addressed with 0.13 version. We are using hive 1.1 (Cloudera). But we are still facing the issue. Jira :- https://issues.apache.org/jira/browse/HIVE-6681 https://www.bountysource.com/issues/1320154-describe-on-a-table-returns-from-deserializer-for-column

Hive - Rolling up the amount balance from leaf node to top parent

怎甘沉沦 提交于 2019-12-11 04:31:44
问题 I have Hierarchy table have Organization level Parent Child relationships. and other table has account balance for the lowest level child in hierarchy table. I need to find all levels of Hierarchy starting from top child to lowest child. All top parent_node have top end parent as "****" . Please suggest hive query to solve this problem. Input Table: Hierarchy Table: +---------------+----------------+ |parent_node_id | child_node_id | +---------------+----------------+ | C1 | C11 | +----------