hiveql

Delta/Incremental Load in Hive

◇◆丶佛笑我妖孽 提交于 2019-11-30 07:30:13
I have the use case below : My application has a table having multiyear data in RDBMS DB. We have used sqoop to get data into HDFS and have loaded into hive table partitioned by year, month . Now, the application updates, and inserts new records into RDBMS Table table daily as well. These updated records can span across history months. Updated records and new insert records can be determined by updated timestamp field (it will have current day timestamp). Now the problem here is : how to do delta/incremental load hive table daily using these updated records. -> I know there is a sqoop

How to calculate median in Hive

孤者浪人 提交于 2019-11-30 03:21:39
I have a hive table, name age sal A 45 1222 B 50 4555 c 44 8888 D 78 1222 E 12 7888 F 23 4555 I want to calculate median of age column. Below is my approach select min(age) as HMIN,max(age) as HMAX,count(age) as HCount, IF(count(age)%2=0,'even','Odd') as PCOUNT from v_act_subjects_bh; Appreciate any query suggestion Amar You can use the percentile function to compute the median. Try this: select percentile(cast(age as BIGINT), 0.5) from table_name 来源: https://stackoverflow.com/questions/26863139/how-to-calculate-median-in-hive

How to select current date in Hive SQL

陌路散爱 提交于 2019-11-30 02:36:50
How do we get the current system date in Hive? In MySQL we have select now(), can any one please help me to get the query results. I am very new to Hive, is there a proper documentation for Hive that gives the details information about the pseudo columns, and built-in functions. According to the LanguageManual , you can use unix_timestamp() to get the "current time stamp using the default time zone." If you need to convert that to something more human-readable, you can use from_unixtime(unix_timestamp()) . Hope that helps. Aniket Asati Yes... I am using Hue 3.7.0 - The Hadoop UI and to get

How to create an external Hive table with column typed Timestamp

旧时模样 提交于 2019-11-29 23:54:33
问题 I want to create an external Hive table from a text file containing epoch in HDFS. Let's say the file is located at the /user/me/test.txt . Here's the file content: 1354183921 1354183922 I have Hive 0.8.1 installed and should be able to use type Timestamp, so I created the table: hive> CREATE EXTERNAL TABLE test1 (epoch Timestamp) LOCATION '/user/me'; Then I queried the table: SELECT * FROM test1; and got the following exception: Failed with exception java.io.IOException:java.lang

PySpark: withColumn() with two conditions and three outcomes

∥☆過路亽.° 提交于 2019-11-29 22:59:16
I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode: df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.) I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into expr() but couldn't get it to work. Note that df is a pyspark.sql.dataframe.DataFrame . There are a few efficient ways to implement this. Let's start with required imports: from pyspark.sql.functions import col, expr, when You can use Hive IF function inside expr: new_column_1 =

Hive QL - Limiting number of rows per each item

无人久伴 提交于 2019-11-29 19:13:01
问题 If I have multiple items listed in a where clause How would one go about limiting the results to N for each item in the list? EX: select a_id,b,c, count(*), as sumrequests from table_name where a_id in (1,2,3) group by a_id,b,c limit 10000 回答1: Sounds like your question is to get the top N per a_id. You can do this with a window function, introduced in Hive 11. Something like: SELECT a_id, b, c, count(*) as sumrequests FROM ( SELECT a_id, b, c, row_number() over (Partition BY a_id) as row

Row number functionality in Hive

匆匆过客 提交于 2019-11-29 19:05:02
问题 How can I generate row numbers for an existing table while running a select query? For example: select row_number(), * from emp; I am using hive 0.13. I can't access external jars or udfs in my environment. The underlying files are in parquet format. Thanks in advance! 回答1: ROW_NUMBER() is a windowing function so it needs to be used in conjunction with an OVER clause. Just don't specify any PARTITION . SELECT *, ROW_NUMBER() OVER () AS row_num FROM emp --- other stuff 回答2: row_number() can be

how to convert date 2017-sep-12 To 2017-09-12 in HIVE

試著忘記壹切 提交于 2019-11-29 16:12:59
I am facing one issue in converting the date in hive. I need to convert 2017-sep-12 To 2017-09-12 . How can i achieve this in HIVE Use unix_timestamp(string date, string pattern) to convert given date format to seconds passed from 1970-01-01. Then use from_unixtime() to convert to given format: hive> select from_unixtime(unix_timestamp('2017-sep-12' ,'yyyy-MMM-dd'), 'dd-MM-yyyy'); OK 12-09-2017 来源: https://stackoverflow.com/questions/47301455/how-to-convert-date-2017-sep-12-to-2017-09-12-in-hive

Find last day of a month in Hive

北城以北 提交于 2019-11-29 13:45:54
My question is : Is there a way to do find the last day of a month in Hive, like Oracle SQL function ? : LAST_DAY(D_Dernier_Jour) Thanks. You could make use of last_day(dateString) UDF provided by Nexr. It returns the last day of the month based on a date string with yyyy-MM-dd HH:mm:ss pattern. Example: SELECT last_day('2003-03-15 01:22:33') FROM src LIMIT 1; 2003-03-31 00:00:00 You need to pull it from their Github Repository and build. Their wiki page contains all the info on how to build and use it with Hive. HTH As of Hive 1.1.0, last_day(string date) function is available. last_day

Hive Does the order of the data record matters for joining tables

喜夏-厌秋 提交于 2019-11-29 12:09:55
I would like to know if the order of the data records matter (performance wise) when joining two tables? P.S. I am not using any map-side join or bucket join. Thank you! On the one hand order should not matter because during shuffle join files are being read by mappers in parallel, also files may be splitted between few mappers or vice-versa, one mapper can read few files, then mappers output passed to each reducer. And even if data was ordered it is being read and distributed not in it's order due to parallelism . On the other hand, ordering data may improve compression depending on the data