window-functions

Count (Distinct ([value)) OVER (Partition by) in SQL Server 2008

纵饮孤独 提交于 2020-01-02 03:26:06
问题 I have written this and successfully executed in Oracle COUNT (DISTINCT APEC.COURSE_CODE) OVER ( PARTITION BY s.REGISTRATION_NUMBER ,APEC.APE_ID ,COV.ACADEMIC_SESSION ) APE_COURSES_PER_ACADEMIC_YEAR I'm trying to achieve the same result in SQL Server (our source database uses Oracle but our warehouse uses SQL Server). I know the distinct isn't supported with window functions in SQL Server 2008 - can anyone suggest an alternative? 回答1: Here's what I recently came across. I got it from this

What is the difference between rowsBetween and rangeBetween?

别来无恙 提交于 2020-01-01 03:58:14
问题 From the PySpark docs rangeBetween: rangeBetween(start, end) Defines the frame boundaries, from start (inclusive) to end (inclusive). Both start and end are relative from the current row. For example, “0” means “current row”, while “-1” means one off before the current row, and “5” means the five off after the current row. Parameters: start – boundary start, inclusive. The frame is unbounded if this is -sys.maxsize (or lower). end – boundary end, inclusive. The frame is unbounded if this is

How to use lag and rangeBetween functions on timestamp values?

孤者浪人 提交于 2019-12-31 23:08:32
问题 I have data that looks like this: userid,eventtime,location_point 4e191908,2017-06-04 03:00:00,18685891 4e191908,2017-06-04 03:04:00,18685891 3136afcb,2017-06-04 03:03:00,18382821 661212dd,2017-06-04 03:06:00,80831484 40e8a7c3,2017-06-04 03:12:00,18825769 I would like to add a new boolean column that marks true if there are 2 or more userid within a 5 minutes window in the same location_point . I had an idea of using lag function to lookup over a window partitioned by the userid and with the

How to “reset” running SUM after it reaches a threshold?

狂风中的少年 提交于 2019-12-30 07:22:33
问题 I wrote a query that creates two columns: the_day , and the amount_raised on that day. Here is what I have: And I would like to add a column that has a running sum of amount_raised : Ultimately, I would like the sum column to reset after it reaches 1 million. The recursive approach is above my pay grade, so if anyone knows a way to reset the sum without creating an entirely new table, please comment (maybe with a RESET function?). Thank you 回答1: I'd like to thank Juan Carlos Oropeza for

pyspark: rolling average using timeseries data

烈酒焚心 提交于 2019-12-29 03:15:27
问题 I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I was initially looking at the pyspark.sql.functions.window function, but that bins the data by week. Here's an example: %pyspark import datetime from pyspark.sql import functions as F df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars",

Oracle “Partition By” Keyword

安稳与你 提交于 2019-12-28 01:38:30
问题 Can someone please explain what the partition by keyword does and give a simple example of it in action, as well as why one would want to use it? I have a SQL query written by someone else and I'm trying to figure out what it does. An example of partition by: SELECT empno, deptno, COUNT(*) OVER (PARTITION BY deptno) DEPT_COUNT FROM emp The examples I've seen online seem a bit too in-depth. 回答1: The PARTITION BY clause sets the range of records that will be used for each "GROUP" within the

String Aggregation in ORACLE 10g with three columns

丶灬走出姿态 提交于 2019-12-25 12:48:11
问题 This is a sample table data Date | Fruit | Number ----------------------- 1 | Apple | 1 1 | Apple | 2 1 | Apple | 3 1 | Kiwi | 6 1 | Kiwi | 10 2 | Apple | 4 2 | Apple | 5 2 | Apple | 6 2 | Kiwi | 4 2 | Kiwi | 7 I try to concatenate the table column values to get the following: Date | Fruit | Number ----------------------- 1 | Apple | 1-2-3 1 | Kiwi | 6-10 2 | Apple | 4-5-6 2 | Kiwi | 4-7 Code that I use: SELECT fruit, LTRIM( MAX(SYS_CONNECT_BY_PATH(number,',')) KEEP (DENSE_RANK LAST ORDER BY

Get column values from multiple rows as array

冷暖自知 提交于 2019-12-25 07:48:26
问题 I am trying to fetch column values as an array in order to use them in the function array_agg_transfn() to calculate the median value as defined in the Postgres Wiki. The column values of a particular column I fetch based on the current row. For example, 13 rows below the current row. I tried using the following query: select a."Week_value", array_agg(a."Week_value") over(order by prod_name,week_date desc rows between 0 preceding and 12 following) from vin_temp_table But got this error

create partition based on the difference between subsequent row indices in sql server 2012

大兔子大兔子 提交于 2019-12-25 07:48:26
问题 I am using SQL Server 2012. I want to create a row_number based on whether the index in subsequent rows are increasing by 1 or more. For example, say I have a table that looks like: event row_index 1 24 2 25 3 26 4 30 5 31 6 42 7 43 8 44 9 45 Then what I want to do is create a column at the end, called seq_ID: event row_index seq_id 1 24 1 2 25 1 3 26 1 4 30 2 5 31 2 6 42 3 7 43 3 8 44 3 9 45 3 basically, the seq_id only chances if the difference between subsequent row indexes is > 1. I have

sumProduct in sql

天涯浪子 提交于 2019-12-25 05:48:07
问题 I'm trying implementing sumproduct (from excel) in my table on the server. select * into #myTable2 from #myTable1 select a, b, c, d, e, ( select (c * e)/100*3423) from #myTable1 t1 inner join #myTable t2 on t1.b = t2.b where b like 'axr%' ) as sumProduct from #myTable1 but this doesn't quite work. Can't spot the error, maybe i'm just tired or missing it. edit: sample data and desired results will mention only the important columns c e b a sumProduct 2 4 axr1 2012.03.01 2*4 + 3*8 3 8 axr3 2012