window-functions

Counting null values between dates

拈花ヽ惹草 提交于 2021-01-29 03:37:35
问题 I'm trying to calculate the number of null values between dates. My table looks like this: transaction_date transaction_sale 10/1/2018 NULL 11/1/2018 33 12/1/2018 NULL 1/1/2019 NULL 2/1/2019 NULL 3/1/2019 2 4/1/2019 NULL 5/1/2019 NULL 6/1/2019 10 I'm looking to get the following output: transaction_date transaction_sale count 10/1/2018 NULL NULL 11/1/2018 33 1 12/1/2018 NULL NULL 1/1/2019 NULL NULL 2/1/2019 NULL NULL 3/1/2019 2 3 4/1/2019 NULL NULL 5/1/2019 NULL NULL 6/1/2019 10 2 回答1: count(

mysql window function with case

三世轮回 提交于 2021-01-28 11:36:45
问题 I'm trying to perform a window function with a built in case. Here's an example which should make it more clear. Original Table: SELECT trade_date, ticker, trans_type, quantity FROM orders WHERE trade_date >= '2020-11-16'; Results: |trade_date|ticker|trans_type|quantity| |:---------|:-----|:---------|-------:| |2020-12-10|FB |BUY |100 | |2020-12-28|FB |BUY |50 | |2020-12-29|FB |SELL |80 | |2020-12-30|FB |SELL |30 | |2020-12-31|FB |BUY |40 | |2020-11-16|AAPL |BUY |30 | |2020-11-17|AAPL |SELL

How to add columns in pyspark dataframe dynamically

倾然丶 夕夏残阳落幕 提交于 2021-01-28 10:57:15
问题 I am trying to add few columns based on input variable vIssueCols from pyspark.sql import HiveContext from pyspark.sql import functions as F from pyspark.sql.window import Window vIssueCols=['jobid','locid'] vQuery1 = 'vSrcData2= vSrcData' vWindow1 = Window.partitionBy("vKey").orderBy("vOrderBy") for x in vIssueCols: Query1=vQuery1+'.withColumn("'+x+'_prev",F.lag(vSrcData.'+x+').over(vWindow1))' exec(vQuery1) now above query will generate vQuery1 as below, and it is working, but vSrcData2=

MySQL Group from quarters to periods

半腔热情 提交于 2021-01-28 06:02:22
问题 I have a table like this: Person smallint(5) act_time datetime 1 2020-05-29 07:00:00 1 2020-05-29 07:15:00 1 2020-05-29 07:30:00 2 2020-05-29 07:15:00 2 2020-05-29 07:30:00 1 2020-05-29 10:30:00 1 2020-05-29 10:45:00 The table above is an example with 2 different persons and there is a row for each quarter they are at work... What is the best way in MySQL to "convert" this table to another table where there is a column for "person", a column for "start" and one for "stop". So the result is

PostgreSQL: Identifying return visitors based on date - joins or window functions?

纵然是瞬间 提交于 2021-01-28 05:11:46
问题 I am looking to identify return visitors to a website within a 7 day window. A data sample and attempt at solving are included below: visitor_id(integer) session_id(integer) event_sequence(integer) d_date(date) Sample raw data: +-----------+-------------+----------------+-------------+ | visitor_id| session_id | event_sequence | d_date | +-----------+-------------+----------------+-------------+ | 1 | 1 | 1 | 2017-01-01 | | 1 | 1 | 2 | 2017-01-01 | | 1 | 1 | 3 | 2017-01-01 | | 1 | 2 | 1 |

fill column with last value from partition in postgresql

喜你入骨 提交于 2021-01-28 00:26:47
问题 I'm trying to return the last value of a partition and apply it to the rest of the column For example, if I have the below... ID Date Status 1 20150101 1 20150201 1 20150301 1 20150401 void 2 20150101 2 20150201 2 20150301 I want to return this. ID Date Status 1 20150101 void 1 20150201 void 1 20150301 void 1 20150401 void 2 20150101 2 20150201 2 20150301 I've been playing around with the below and similar to no avail. select ID, date, case when status is null then last_value(status ignore

SELECT fixed number of rows by evenly skipping rows

 ̄綄美尐妖づ 提交于 2021-01-27 18:41:06
问题 I am trying to write a query which returns an arbitrary sized representative sample of data. I would like to do this by only selecting n th rows where n is such that the entire result set is as close as possible to an arbitrary size. I want this to work in cases where the result set would normally be less than the arbitrary size. In such a case, the entire result set should be returned. I found this question which shows how to select every n th row. Here is what I have so far: SELECT * FROM (

How to pivot row data into specific columns db2

空扰寡人 提交于 2021-01-07 06:24:01
问题 I would like to pivot results from a table into a new structure. So that it can map all the children to the parent product. Current Result Parent_Prod_Num|Child_Prod_Num|Child_Prod_Code|Child_Prod_Name 1|11|a123|a 1|12|b123|ab 1|13|c123|abc Expected Result Parent_Prod_Num|Child_Prod_Num_1| Child_Prod_Code_1|Child_Prod_Name_1| Child_Prod_Num_2| Child_Prod_Code_2|Child_Prod_Name_2| Child_Prod_Num_3| Child_Prod_Code_3|Child_Prod_Name_3 1|11|a123|a|12|b123|ab|13|c123|abc 回答1: For a fixed maximum

How to pivot row data into specific columns db2

痴心易碎 提交于 2021-01-07 06:19:47
问题 I would like to pivot results from a table into a new structure. So that it can map all the children to the parent product. Current Result Parent_Prod_Num|Child_Prod_Num|Child_Prod_Code|Child_Prod_Name 1|11|a123|a 1|12|b123|ab 1|13|c123|abc Expected Result Parent_Prod_Num|Child_Prod_Num_1| Child_Prod_Code_1|Child_Prod_Name_1| Child_Prod_Num_2| Child_Prod_Code_2|Child_Prod_Name_2| Child_Prod_Num_3| Child_Prod_Code_3|Child_Prod_Name_3 1|11|a123|a|12|b123|ab|13|c123|abc 回答1: For a fixed maximum

Hive Window Function ROW_NUMBER without Partition BY Clause on a large (50 GB) dataset is very slow. Is there a better way to optimize?

可紊 提交于 2021-01-04 07:25:26
问题 I have a HDFS file with 50 Million records and raw file size is 50 GB. I am trying to load this in a hive table and create unique id for all rows using the below, while loading. I am using Hive 1.1.0-cdh5.16.1. row_number() over(order by event_id, user_id, timestamp) as id While executing I see that in the reduce step, 40 reducers are assigned. Average time for 39 Reducers is about 2 mins whereas the last reducer takes about 25 mins which clearly makes me believe that most of the data is