hiveql

How to reset textinputformat.record.delimiter to its default value within hive cli / beeline?

本小妞迷上赌 提交于 2019-12-05 08:50:53
Setting textinputformat.record.delimiter to a non-default value, is useful for loading multi-row text, as shown in the demo below. However, I'm failing to set this parameter back to its default value without exiting the cli and reopen it. None of the following options worked (nor some other trials) set textinputformat.record.delimiter='\n'; set textinputformat.record.delimiter='\r'; set textinputformat.record.delimiter='\r\n'; set textinputformat.record.delimiter=' '; reset; Any thought? Thanks Demo create table mytable (mycol string); insert into mytable select concat('Hello',unhex('A'),

How to extract selected values from json string in Hive

最后都变了- 提交于 2019-12-05 08:11:28
I am running a simple query in Hive that produces the following output (with a few other additional columns. |------|-----------------------------------------------------------| | col1 | col2 | |------|-----------------------------------------------------------| | A | {"variable1":123,"variable2":456,"variable3":789} | |------|-----------------------------------------------------------| | B | {"variable1":222,"variable2":333,"variable3":444} | -------------------------------------------------------------------- I need to be able to parse the json string and pull out the values for each token

unable to create hive table with primary key

自闭症网瘾萝莉.ら 提交于 2019-12-05 06:45:36
I am unable to create an external table in hive with primary key. Following is the example code: hive> create table exmp((name string),primary key(name)); This returns me the following error message: NoViableAltException(278@[]) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11216) at org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:35977) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameType(HiveParser.java:31169) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameTypeList(HiveParser.java:29373) at

Difference between `load data inpath ` and `location` in hive?

戏子无情 提交于 2019-12-05 05:51:07
At my firm, I see these two commands used frequently, and I'd like to be aware of the differences, because their functionality seems the same to me: 1 create table <mytable> (name string, number double); load data inpath '/directory-path/file.csv' into <mytable>; 2 create table <mytable> (name string, number double); location '/directory-path/file.csv'; They both copy the data from the directory on HDFS into the directory for the table on HIVE. Are there differences that one should be aware of when using these? Thank you. Yes, they are used for different purpose at all. load data inpath

HIVE: How to include null rows in lateral view explode

纵饮孤独 提交于 2019-12-05 05:40:50
I have a table as follows: user_id email u1 e1, e2 u2 null My goal is to convert this into the following format: user_id email u1 e1 u1 e2 u2 null So for this I am using the lateral view explode() function in Hive, as follows: select * FROM table LATERAL VIEW explode ( split ( email ,',' ) ) email AS email_id But doing this the u2 row is getting skipped as it has null value in email. How can we include the nulls too in the output? Edit: I am using a workaround doing an union of this table with the base table without explode, but I think the data will be scanned one more time because of this. I

How to alter Hive partition column name

笑着哭i 提交于 2019-12-05 01:59:40
I have to change the partition column name (not partition spec), I looked for the commands in hive wiki and some google pages. I can find the options for altering the partition spec, i.e. For example In /table/country='US' I can change US to USA, but I want to change country to continent . I feel like the only option available for changing partition column name is dropping and re-creating the table. Is there is any other option available please help me. Thanks in advance. You can change column name in metadata by following: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Hive scanning entire data for bucketed table

三世轮回 提交于 2019-12-04 18:12:22
I was trying to optimize a hive SQL by bucketing the data on a single column. I created the table with following statement CREATE TABLE `source_bckt`( `uk` string, `data` string) CLUSTERED BY(uk) SORTED BY(uk) INTO 10 BUCKETS Then inserted the data after executing "set hive.enforce.bucketing = true;" When I run the following select "select * from source_bckt where uk='1179724';" Even though the data is supposed to be in a single file which can be identified by the following equation HASH('1179724')%10 the mapreduce spawned scans through the entire set of files. Any idea? This optimization is

Is there a way to prevent a Hive table from being overwritten if the SELECT query of the INSERT OVERWRITE does not return any results

本秂侑毒 提交于 2019-12-04 17:56:52
I am developing a batch job that loads data into Hive tables from HDFS files. The flow of data is as follows Read the file received in HDFS using an external Hive table INSERT OVERWRITE the final hive table from the external Hive table applying certain transformations Move the received file to Archive This flow works fine if there is a file in the input directory for the external table to read during step 1. If there is no file, the external table will be empty and as a result executing step 2 will empty the final table. If the external table is empty, I would like to keep the existing data in

Difference in statistics from Google Analytics Report and BigQuery Data in Hive table

谁都会走 提交于 2019-12-04 16:41:55
I have a Google Analytics premium account set up to monitor the user activity of a website and mobile application. Raw data from GA is being stored in BigQuery tables. However, I noticed that the statistics that I see in a GA report are quite different the statistics that I see when querying the BigQuery tables. I understand that GA reports show aggregated data and possibly, sampled data. And that the raw data in Bigquery tables is session/hit-level data. But I am still not sure if I understand the reason why the statistics could be different. Would really appreciate it if someone clarified

passing multiple dates as a paramters to Hive query

风格不统一 提交于 2019-12-04 16:06:16
I am trying to pass a list of dates as parameter to my hive query. #!/bin/bash echo "Executing the hive query - Get distinct dates" var=`hive -S -e "select distinct substr(Transaction_date,0,10) from test_dev_db.TransactionUpdateTable;"` echo $var echo "Executing the hive query - Get the parition data" hive -hiveconf paritionvalue=$var -e 'SELECT Product FROM test_dev_db.TransactionMainHistoryTable where tran_date in("${hiveconf:paritionvalue}");' echo "Hive query - ends" Output as: Executing the hive query - Get distinct dates 2009-02-01 2009-04-01 Executing the hive query - Get the parition