hiveql

How to pass multiple statements into Spark SQL HiveContext

不羁岁月 提交于 2019-12-01 17:22:20
For example I have few Hive HQL statements which I want to pass into Spark SQL: set parquet.compression=SNAPPY; create table MY_TABLE stored as parquet as select * from ANOTHER_TABLE; select * from MY_TABLE limit 5; Following doesn't work: hiveContext.sql("set parquet.compression=SNAPPY; create table MY_TABLE stored as parquet as select * from ANOTHER_TABLE; select * from MY_TABLE limit 5;") How to pass the statements into Spark SQL? Thank you to @SamsonScharfrichter for the answer. This will work: hiveContext.sql("set spark.sql.parquet.compression.codec=SNAPPY") hiveContext.sql("create table

Query A Nested Array in Parquet Records

依然范特西╮ 提交于 2019-12-01 13:21:47
I am trying different ways to query a record within a array of records and display complete Row as output. I dont know which nested Object has String "pg". But i want to query on particular object. Whether the object has "pg" or not. If "pg" exist then i want to display that complete row. How to write "spark sql query" on nested objects without specfying the object index.So i dont want to use the index of children.name My Avro Record: { "name": "Parent", "type":"record", "fields":[ {"name": "firstname", "type": "string"}, { "name":"children", "type":{ "type": "array", "items":{ "name":"child",

Is there a way to identify or detect data skew in Hive table?

烈酒焚心 提交于 2019-12-01 12:28:10
问题 We have many hive queries that take lot of time. We are using tez and other good practices like CBO, using orc files etc. Is there a way to check / analyze data skew like some command? Would an explain plan help and if so, which parameter should I look for? 回答1: Explain plan will not help in this, you should check data. If it is a join, select top 100 join key value from all tables involved in the join, do the same for partition by key if it is analytic function and you will see if it is a

Hive - Is there a way to further optimize a HiveQL query?

我与影子孤独终老i 提交于 2019-12-01 11:47:44
I have written a query to find 10 most busy airports in the USA from March to April. It produces the desired output however I want to try to further optimize it. Are there any HiveQL specific optimizations that can be applied to the query? Is GROUPING SETS applicable here? I'm new to Hive and for now this is the shortest query that I've come up with. SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights FROM ( SELECT Origin AS Airport, FlightsNum FROM flights_stats WHERE (Cancelled = 0 AND Month IN (3,4)) UNION ALL SELECT Dest AS Airport, FlightsNum FROM flights_stats WHERE

“Too many fetch-failures” while using Hive

百般思念 提交于 2019-12-01 11:27:12
I'm running a hive query against a hadoop cluster of 3 nodes. And I am getting an error which says "Too many fetch failures". My hive query is: insert overwrite table tablename1 partition(namep) select id,name,substring(name,5,2) as namep from tablename2; that's the query im trying to run. All i want to do is transfer data from tablename2 to tablename1. Any help is appreciated. javadba This can be caused by various hadoop configuration issues. Here a couple to look for in particular: DNS issue : examine your /etc/hosts Not enough http threads on the mapper side for the reducer Some suggested

Hive: Merging Configuration Settings not working

[亡魂溺海] 提交于 2019-12-01 11:17:11
On Hive 2.2.0, I am filling an orc table from another source table of size 1.34 GB, using the query INSERT INTO TABLE TableOrc SELECT * FROM Table; ---- (1) The query creates TableORC table with 6 orc files, which are much smaller than the block size of 256MB. -- FolderList1 -rwxr-xr-x user1 supergroup 65.01 MB 1/1/2016, 10:14:21 AM 1 256 MB 000000_0 -rwxr-xr-x user1 supergroup 67.48 MB 1/1/2016, 10:14:55 AM 1 256 MB 000001_0 -rwxr-xr-x user1 supergroup 66.3 MB 1/1/2016, 10:15:18 AM 1 256 MB 000002_0 -rwxr-xr-x user1 supergroup 63.83 MB 1/1/2016, 10:15:41 AM 1 256 MB 000003_0 -rwxr-xr-x user1

How to insert data into a Hive(0.13.1) table?

夙愿已清 提交于 2019-12-01 11:04:44
I am using Hive version 0.13.1. While trying to insert data into an existing table getting an error while using the below query: CREATE TABLE table1 (order_num int, payment_type varchar(20), category varchar(20)); INSERT INTO TABLE table1 VALUES (151, 'cash', 'lunch'); ERROR : ParseException line 1:25 cannot recognize input near 'VALUES' '(' '151' in select clause While searching, got everyone suggesting above query, but unfortunately it's not working for me. Is it due to different Hive version? I am getting this ambiguity due to link here Needs help to insert data to an existing table in Hive

Hive - Is there a way to further optimize a HiveQL query?

拈花ヽ惹草 提交于 2019-12-01 09:59:14
问题 I have written a query to find 10 most busy airports in the USA from March to April. It produces the desired output however I want to try to further optimize it. Are there any HiveQL specific optimizations that can be applied to the query? Is GROUPING SETS applicable here? I'm new to Hive and for now this is the shortest query that I've come up with. SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights FROM ( SELECT Origin AS Airport, FlightsNum FROM flights_stats WHERE

How to insert data into a Hive(0.13.1) table?

℡╲_俬逩灬. 提交于 2019-12-01 09:18:14
问题 I am using Hive version 0.13.1. While trying to insert data into an existing table getting an error while using the below query: CREATE TABLE table1 (order_num int, payment_type varchar(20), category varchar(20)); INSERT INTO TABLE table1 VALUES (151, 'cash', 'lunch'); ERROR : ParseException line 1:25 cannot recognize input near 'VALUES' '(' '151' in select clause While searching, got everyone suggesting above query, but unfortunately it's not working for me. Is it due to different Hive

Query A Nested Array in Parquet Records

雨燕双飞 提交于 2019-12-01 08:22:50
问题 I am trying different ways to query a record within a array of records and display complete Row as output. I dont know which nested Object has String "pg". But i want to query on particular object. Whether the object has "pg" or not. If "pg" exist then i want to display that complete row. How to write "spark sql query" on nested objects without specfying the object index.So i dont want to use the index of children.name My Avro Record: { "name": "Parent", "type":"record", "fields":[ {"name":