hiveql

“Too many fetch-failures” while using Hive

 ̄綄美尐妖づ 提交于 2019-12-04 01:47:07
问题 I'm running a hive query against a hadoop cluster of 3 nodes. And I am getting an error which says "Too many fetch failures". My hive query is: insert overwrite table tablename1 partition(namep) select id,name,substring(name,5,2) as namep from tablename2; that's the query im trying to run. All i want to do is transfer data from tablename2 to tablename1. Any help is appreciated. 回答1: This can be caused by various hadoop configuration issues. Here a couple to look for in particular: DNS issue :

How to list all hive databases being in use or created so far?

北战南征 提交于 2019-12-04 00:06:20
Similar to SHOW TABLES command, do we have any such command to list all databases created so far? This page mentions the command SHOW DATABASES . From the manual: SHOW (DATABASES|SCHEMAS) [LIKE identifier_with_wildcards]; SHOW DATABASES lists all of the databases defined in the metastore. The optional LIKE clause allows the list of databases to be filtered using a regular expression. Wildcards in the regular expression can only be '' for any character(s) or '|' for a choice. Examples are 'employees', 'emp', 'emp*|*ees', all of which will match the database named 'employees'. show databases;

Select top 2 rows in Hive

≯℡__Kan透↙ 提交于 2019-12-03 22:18:50
I'm a noobie here. I'm trying to retrieve top 2 tables from my employee list based on salary in hive (version 0.11). Since it doesn't support TOP function, is there any alternatives? Or do we have define a UDF? Akshay Shrivastava Yes, here you can use LIMIT . You can try it by the below query: SELECT * FROM employee_list SORT BY salary DESC LIMIT 2 select * from employee_list order by salary desc limit 2; 来源: https://stackoverflow.com/questions/30441744/select-top-2-rows-in-hive

Transferring hive table from one database to another

旧城冷巷雨未停 提交于 2019-12-03 22:10:17
I need to move a hive table from one database to another. How can I do that? Since 0.14, you can use following statement to move table from one database to another in the same metastore: use old_database; alter table table_a rename to new_database.table_a The above statements will also move the table data on hdfs if table_a is a managed table. create external table new_db.table like old_db.table location '(path of file in hdfs file)'; if you have partition in table then you have to add partition in new_db.table. You can try - CTAS USE NEW_DB; CREATE TABLE table AS SELECT * FROM OLD_DB.table;

Empty String is not treated as null in Hive

江枫思渺然 提交于 2019-12-03 20:29:46
问题 My understanding of the following statement is that if blank or empty string is inserted into hive column, it will be treated as null. TBLPROPERTIES('serialization.null.format'='' To test the functionality i have created a table and insertted '' to the filed 3. When i query for nulls on the field3, there are no rows with that criteria. Is my understanding of making blank string to null correct?? CREATE TABLE CDR ( field1 string, field2 string, field3 string ) ROW FORMAT DELIMITED FIELDS

Wrong result for count(*) in hive table

落花浮王杯 提交于 2019-12-03 20:05:10
问题 I have created a table in HIVE CREATE TABLE IF NOT EXISTS daily_firstseen_analysis ( firstSeen STRING, category STRING, circle STRING, specId STRING, language STRING, osType STRING, count INT) PARTITIONED BY (day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS orc; count(*) is not giving me correct result for this table hive> select count(*) from daily_firstseen_analysis; OK 75 Time taken: 0.922 seconds, Fetched: 1 row(s) While the number of rows in this table is 959 rows

PySpark: withColumn() with two conditions and three outcomes

百般思念 提交于 2019-12-03 18:32:08
问题 I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode: df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.) I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into expr() but couldn't get it to work. Note that df is a pyspark.sql.dataframe.DataFrame . 回答1: There are a few efficient ways to implement this. Let's start with required imports:

Is Hive's collect_list ordered?

淺唱寂寞╮ 提交于 2019-12-03 17:31:35
This page says of collect_list: Returns a list of objects with duplicates. Is that list ordered? For example, the order of the query results? built-in collect_list isn't guaranteed to be ordered, even if you do an order by first (even if it did ensure order, doing it this way is a waste of time). Just use brickhouse collect ; it ensures the elements are ordered. It's correct that collect_list isn't guaranteed to be ordered. The function sort_array will sort the result: select a, b, sort_array(collect_list(c)) as sorted_c from the_table group by a, b 来源: https://stackoverflow.com/questions

Unable to connect to HIVE2 via JAVA

眉间皱痕 提交于 2019-12-03 15:35:59
Referring to Hive2 created a simple java program to connect to HIVE2 server (not local) have added all mentioned jars in the above link in the class path in eclipse as well however when I run the code it throws an error as: 09:42:35,580 INFO Utils:285 - Supplied authorities: hdstg-c01-edge-03:20000 09:42:35,583 INFO Utils:372 - Resolved authority: hdstg-c01-edge-03:20000 09:42:35,656 INFO HiveConnection:189 - Will try to open client transport with JDBC Uri: jdbc:hive2://hdstg-c01-edge-03:20000 FAILED: f java.lang.NoSuchMethodError: org.apache.thrift.protocol.TProtocol.getScheme()Ljava/lang

Is LIMIT clause in HIVE really random?

穿精又带淫゛_ 提交于 2019-12-03 13:19:55
The documentation of HIVE notes that LIMIT clause returns rows chosen at random . I have been running a SELECT table on a table with more than 800,000 records with LIMIT 1 , but it always return me the same record. I'm using the Shark distribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated. Thanks, Visakh Even though the documentation states it returns rows at random, it's not actually true. It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not