window-functions | 易学教程

Count (Distinct ([value)) OVER (Partition by) in SQL Server 2008

阅读更多关于 Count (Distinct ([value)) OVER (Partition by) in SQL Server 2008

问题 I have written this and successfully executed in Oracle COUNT (DISTINCT APEC.COURSE_CODE) OVER ( PARTITION BY s.REGISTRATION_NUMBER ,APEC.APE_ID ,COV.ACADEMIC_SESSION ) APE_COURSES_PER_ACADEMIC_YEAR I'm trying to achieve the same result in SQL Server (our source database uses Oracle but our warehouse uses SQL Server). I know the distinct isn't supported with window functions in SQL Server 2008 - can anyone suggest an alternative? 回答1: Here's what I recently came across. I got it from this

What is the difference between rowsBetween and rangeBetween?

阅读更多关于 What is the difference between rowsBetween and rangeBetween?

问题 From the PySpark docs rangeBetween: rangeBetween(start, end) Defines the frame boundaries, from start (inclusive) to end (inclusive). Both start and end are relative from the current row. For example, “0” means “current row”, while “-1” means one off before the current row, and “5” means the five off after the current row. Parameters: start – boundary start, inclusive. The frame is unbounded if this is -sys.maxsize (or lower). end – boundary end, inclusive. The frame is unbounded if this is

How to use lag and rangeBetween functions on timestamp values?

阅读更多关于 How to use lag and rangeBetween functions on timestamp values?

问题 I have data that looks like this: userid,eventtime,location_point 4e191908,2017-06-04 03:00:00,18685891 4e191908,2017-06-04 03:04:00,18685891 3136afcb,2017-06-04 03:03:00,18382821 661212dd,2017-06-04 03:06:00,80831484 40e8a7c3,2017-06-04 03:12:00,18825769 I would like to add a new boolean column that marks true if there are 2 or more userid within a 5 minutes window in the same location_point . I had an idea of using lag function to lookup over a window partitioned by the userid and with the

How to “reset” running SUM after it reaches a threshold?

阅读更多关于 How to “reset” running SUM after it reaches a threshold?

问题 I wrote a query that creates two columns: the_day , and the amount_raised on that day. Here is what I have: And I would like to add a column that has a running sum of amount_raised : Ultimately, I would like the sum column to reset after it reaches 1 million. The recursive approach is above my pay grade, so if anyone knows a way to reset the sum without creating an entirely new table, please comment (maybe with a RESET function?). Thank you 回答1: I'd like to thank Juan Carlos Oropeza for

pyspark: rolling average using timeseries data

阅读更多关于 pyspark: rolling average using timeseries data

问题 I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I was initially looking at the pyspark.sql.functions.window function, but that bins the data by week. Here's an example: %pyspark import datetime from pyspark.sql import functions as F df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars",

Oracle “Partition By” Keyword

阅读更多关于 Oracle “Partition By” Keyword

问题 Can someone please explain what the partition by keyword does and give a simple example of it in action, as well as why one would want to use it? I have a SQL query written by someone else and I'm trying to figure out what it does. An example of partition by: SELECT empno, deptno, COUNT(*) OVER (PARTITION BY deptno) DEPT_COUNT FROM emp The examples I've seen online seem a bit too in-depth. 回答1: The PARTITION BY clause sets the range of records that will be used for each "GROUP" within the

String Aggregation in ORACLE 10g with three columns

阅读更多关于 String Aggregation in ORACLE 10g with three columns

Get column values from multiple rows as array

阅读更多关于 Get column values from multiple rows as array

问题 I am trying to fetch column values as an array in order to use them in the function array_agg_transfn() to calculate the median value as defined in the Postgres Wiki. The column values of a particular column I fetch based on the current row. For example, 13 rows below the current row. I tried using the following query: select a."Week_value", array_agg(a."Week_value") over(order by prod_name,week_date desc rows between 0 preceding and 12 following) from vin_temp_table But got this error

create partition based on the difference between subsequent row indices in sql server 2012

阅读更多关于 create partition based on the difference between subsequent row indices in sql server 2012

问题 I am using SQL Server 2012. I want to create a row_number based on whether the index in subsequent rows are increasing by 1 or more. For example, say I have a table that looks like: event row_index 1 24 2 25 3 26 4 30 5 31 6 42 7 43 8 44 9 45 Then what I want to do is create a column at the end, called seq_ID: event row_index seq_id 1 24 1 2 25 1 3 26 1 4 30 2 5 31 2 6 42 3 7 43 3 8 44 3 9 45 3 basically, the seq_id only chances if the difference between subsequent row indexes is > 1. I have

sumProduct in sql

阅读更多关于 sumProduct in sql

问题 I'm trying implementing sumproduct (from excel) in my table on the server. select * into #myTable2 from #myTable1 select a, b, c, d, e, ( select (c * e)/100*3423) from #myTable1 t1 inner join #myTable t2 on t1.b = t2.b where b like 'axr%' ) as sumProduct from #myTable1 but this doesn't quite work. Can't spot the error, maybe i'm just tired or missing it. edit: sample data and desired results will mention only the important columns c e b a sumProduct 2 4 axr1 2012.03.01 2*4 + 3*8 3 8 axr3 2012