hiveql

Grouping hive rows in an array of this rows

徘徊边缘 提交于 2019-12-03 12:49:21
问题 I have a table like the following : User:String Alias:String JohnDoe John JohnDoe JDoe Roger Roger And I would like to group all the aliases of an user in an array, in a new table which would look like this : User:String Alias:array<String> JohnDoe [John, JDoe] Roger [Roger] I can't figure out how to do that with HiveQL.Do I have to write an UDF for that ? Thanks ! 回答1: Check out the built-in aggregate function collect_set . select User, collect_set(Alias) as Alias from table group by User;

How to query struct array with Hive (get_json_object)?

ぃ、小莉子 提交于 2019-12-03 10:01:24
I store the following JSON objects in a Hive table: { "main_id": "qwert", "features": [ { "scope": "scope1", "name": "foo", "value": "ab12345", "age": 50, "somelist": ["abcde","fghij"] }, { "scope": "scope2", "name": "bar", "value": "cd67890" }, { "scope": "scope3", "name": "baz", "value": [ "A", "B", "C" ] } ] } "features" is an array of varying length, i.e. all objects are optional. The objects have arbitrary elements, but all of them contain "scope", "name" and "value". This is the Hive table I created: CREATE TABLE tbl( main_id STRING,features array<struct<scope:STRING,name:STRING,value

How to calculate Date difference in Hive

ぃ、小莉子 提交于 2019-12-03 09:38:44
问题 I'm a novice. I have a employee table with a column specifying the joining date and I want to retrieve the list of employees who have joined in the last 3 months. I understand we can get the current date using from_unixtime(unix_timestamp()). How do I calculate the datediff? Is there a built in DATEDIFF() function like in MS SQL? please advice! 回答1: datediff(to_date(String timestamp), to_date(String timestamp)) For example: SELECT datediff(to_date('2019-08-03'), to_date('2019-08-01')) <= 2;

How to copy all hive table from one Database to other Database

牧云@^-^@ 提交于 2019-12-03 04:57:37
问题 I have default db in hive table which contains 80 tables . I have created one more database and I want to copy all the tables from default DB to new Databases. Is there any way I can copy from One DB to Other DB, without creating individual table. Please let me know if any solution.. Thanks in advance 回答1: I can think of couple of options. Use CTAS. CREATE TABLE NEWDB.NEW_TABLE1 AS select * from OLDDB.OLD_TABLE1; CREATE TABLE NEWDB.NEW_TABLE2 AS select * from OLDDB.OLD_TABLE2; ... Use IMPORT

Grouping hive rows in an array of this rows

 ̄綄美尐妖づ 提交于 2019-12-03 03:53:54
I have a table like the following : User:String Alias:String JohnDoe John JohnDoe JDoe Roger Roger And I would like to group all the aliases of an user in an array, in a new table which would look like this : User:String Alias:array<String> JohnDoe [John, JDoe] Roger [Roger] I can't figure out how to do that with HiveQL.Do I have to write an UDF for that ? Thanks ! Check out the built-in aggregate function collect_set . select User, collect_set(Alias) as Alias from table group by User; 来源: https://stackoverflow.com/questions/16836702/grouping-hive-rows-in-an-array-of-this-rows

How to calculate Date difference in Hive

好久不见. 提交于 2019-12-03 00:16:51
I'm a novice. I have a employee table with a column specifying the joining date and I want to retrieve the list of employees who have joined in the last 3 months. I understand we can get the current date using from_unixtime(unix_timestamp()). How do I calculate the datediff? Is there a built in DATEDIFF() function like in MS SQL? please advice! Kishore datediff(to_date(String timestamp), to_date(String timestamp)) For example: SELECT datediff(to_date('2019-08-03'), to_date('2019-08-01')) <= 2; If you need the difference in seconds (i.e.: you're comparing dates with timestamps, and not whole

Hive order by not visible column

回眸只為那壹抹淺笑 提交于 2019-12-02 20:40:27
问题 Let's say I have table test with column a,b and c and test2 with same column. Can I create a view of table test and test 2 joined together and ordered by field c from table test without showing it in final output? In my case: CREATE VIEW AS test_view AS SELECT a,b FROM (SELECT * FROM test ORDER BY c) JOIN test2 ON test.a =test2.a; Ok I test it and it is not possible because shuffle phase so maybe there is another solution to somehow do it? Table are too big to do broadcast join. Of course I

How to copy all hive table from one Database to other Database

自古美人都是妖i 提交于 2019-12-02 18:13:50
I have default db in hive table which contains 80 tables . I have created one more database and I want to copy all the tables from default DB to new Databases. Is there any way I can copy from One DB to Other DB, without creating individual table. Please let me know if any solution.. Thanks in advance Venkat Ankam I can think of couple of options. Use CTAS. CREATE TABLE NEWDB.NEW_TABLE1 AS select * from OLDDB.OLD_TABLE1; CREATE TABLE NEWDB.NEW_TABLE2 AS select * from OLDDB.OLD_TABLE2; ... Use IMPORT feature of Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport

Add a column in a table in HIVE QL

故事扮演 提交于 2019-12-02 17:33:43
I'm writing a code in HIVE to create a table consisting of 1300 rows and 6 columns: create table test1 as SELECT cd_screen_function, SUM(access_count) AS max_count, MIN(response_time_min) as response_time_min, AVG(response_time_avg) as response_time_avg, MAX(response_time_max) as response_time_max, SUM(response_time_tot) as response_time_tot, COUNT(*) as row_count FROM sheet WHERE ts_update BETWEEN unix_timestamp('2012-11-01 00:00:00') AND unix_timestamp('2012-11-30 00:00:00') and cd_office = '016' GROUP BY cd_screen_function ORDER BY max_count DESC, cd_screen_function; Now I want to add

How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape

走远了吗. 提交于 2019-12-02 13:33:12
问题 I am currently trying to associate each lat long ping from a device to its ZIP code. I have de-normalized lat long device ping data and created a cross-product/ Cartesian product join table in which each row has the ST_Point(long,lat), geometry_shape_of_ZIP and associated zip code for that geometry. for testing purpose I have around 45 million rows in the table and it'll increase in production to about 1 billion every day. Even though the data is flattened and no join conditions, the query