window-functions

Hive Window Function ROW_NUMBER without Partition BY Clause on a large (50 GB) dataset is very slow. Is there a better way to optimize?

£可爱£侵袭症+ 提交于 2021-01-04 07:24:06
问题 I have a HDFS file with 50 Million records and raw file size is 50 GB. I am trying to load this in a hive table and create unique id for all rows using the below, while loading. I am using Hive 1.1.0-cdh5.16.1. row_number() over(order by event_id, user_id, timestamp) as id While executing I see that in the reduce step, 40 reducers are assigned. Average time for 39 Reducers is about 2 mins whereas the last reducer takes about 25 mins which clearly makes me believe that most of the data is

How to make LAG() ignore NULLS in SQL Server?

醉酒当歌 提交于 2021-01-04 02:54:46
问题 Does anyone know how to replace nulls in a column with a string until it hits a new string then that string replaces all null values below it? I have a column that looks like this Original Column: PAST_DUE_COL 91 or more days pastdue Null Null 61-90 days past due Null Null 31-60 days past due Null 0-30 days past due Null Null Null Expected Result Column: PAST_DUE_COL 91 or more days past due 91 or more days past due 91 or more days past due 61-90 days past due 61-90 days past due 61-90 days

Select top rows until value in specific column has appeared twice

我的梦境 提交于 2020-12-23 17:08:07
问题 I have the following query where I am trying to select all records, ordered by date, until the second time EmailApproved = 1 is found. The second record where EmailApproved = 1 should not be selected. declare @Test table (id int, EmailApproved bit, Created datetime) insert into @Test (id, EmailApproved, Created) values (1,0,'2011-03-07 03:58:58.423') , (2,0,'2011-02-21 04:55:52.103') , (3,0,'2011-01-29 13:24:02.103') , (4,1,'2010-10-12 14:41:54.217') , (5,0,'2010-10-12 14:34:15.903') , (6,0,