window-functions | 易学教程

How to apply lag function on streaming dataframe?

阅读更多关于 How to apply lag function on streaming dataframe?

问题 I have a streaming dataframe having three columns time,col1,col2. I have to apply lag function on column2. I have tried following query. val w = org.apache.spark.sql.expressions.Window.oderBy("time") df.select(col("time"),col("col1"),lag("col3",1).over(w)) But It gives following exception org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets How can i achieve this? Thanks in advance. 来源： https://stackoverflow.com/questions/46036845

select nearest neighbours

阅读更多关于 select nearest neighbours

问题 consider the following data: category | index | value ------------------------- cat 1 | 1 | 2 cat 1 | 2 | 3 cat 1 | 3 | cat 1 | 4 | 1 cat 2 | 1 | 5 cat 2 | 2 | cat 2 | 3 | cat 2 | 4 | 6 cat 3 | 1 | cat 3 | 2 | cat 3 | 3 | 2 cat 3 | 4 | 1 I am trying to fill in the holes, so that hole = avg(value) of 2 nearest neighbours with non-null values within a category: category | index | value ------------------------- cat 1 | 1 | 2 cat 1 | 2 | 3 cat 1 | 3 | 2* cat 1 | 4 | 1 cat 2 | 1 | 5 cat 2 | 2 | 5

Find movies with highest number of awards in certain year - code duplication

阅读更多关于 Find movies with highest number of awards in certain year - code duplication

问题 I am trying to write a query (PostgreSQL) to get "Movies with highest number of awards in year 2012." I have following tables: CREATE TABLE Award( ID_AWARD bigserial CONSTRAINT Award_pk PRIMARY KEY, award_name VARCHAR(90), category VARCHAR(90), award_year integer, CONSTRAINT award_unique UNIQUE (award_name, category, award_year)); CREATE TABLE AwardWinner( ID_AWARD integer, ID_ACTOR integer, ID_MOVIE integer, CONSTRAINT AwardWinner_pk PRIMARY KEY (ID_AWARD)); And I written following query,

Dynamic row range when calculating moving sum/average using window functions (SQL Server)

阅读更多关于 Dynamic row range when calculating moving sum/average using window functions (SQL Server)

问题 I'm currently working on a sample script which allows me to calculate the sum of the previous two rows and the current row. However, I would like to make the number '2' as a variable. I've tried declaring a variable, or directly casting in the query, yet a syntax error always pops up. Is there a possible solution? DECLARE @myTable TABLE (myValue INT) INSERT INTO @myTable ( myValue ) VALUES ( 5) INSERT INTO @myTable ( myValue ) VALUES ( 6) INSERT INTO @myTable ( myValue ) VALUES ( 7) INSERT

Why do you need to include a field in GROUP BY when using OVER (PARTITION BY x)?

阅读更多关于 Why do you need to include a field in GROUP BY when using OVER (PARTITION BY x)?

问题 I have a table for which I want to do a simple sum of a field, grouped by two columns. I then want the total for all values for each year_num. See example: http://rextester.com/QSLRS68794 This query is throwing: "42803: column "foo.num_cust" must appear in the GROUP BY clause or be used in an aggregate function", and I cannot figure out why. Why would an aggregate function using the OVER (PARTITION BY x) require the summed field to be in GROUP BY?? select year_num ,age_bucket ,sum(num_cust) -

Postgresql: Grouping with limit on group size using window functions

阅读更多关于 Postgresql: Grouping with limit on group size using window functions

问题 Is there a way in Postgresql to write a query which groups rows based on a column with a limit without discarding additional rows. Say I've got a table with three columns id, color, score with the following rows 1 red 10.0 2 red 7.0 3 red 3.0 4 blue 5.0 5 green 4.0 6 blue 2.0 7 blue 1.0 I can get a grouping based on color with window functions with the following query SELECT * FROM ( SELECT id, color, score, rank() OVER (PARTITION BY color ORDER BY score DESC) FROM grouping_test ) AS foo

PySpark: retrieve mean and the count of values around the mean for groups within a dataframe

阅读更多关于 PySpark: retrieve mean and the count of values around the mean for groups within a dataframe

问题 My raw data comes in a tabular format. It contains observations from different variables. Each observation with the variable name, the timestamp and the value at that time. Variable [string], Time [datetime], Value [float] The data is stored as Parquet in HDFS and loaded into a Spark Dataframe (df). From that dataframe. Now I want to calculate default statistics like Mean, Standard Deviation and others for each variable. Afterwards, once the Mean has been retrieved, I want to filter/count

Increment column value on certain condition in SQL query on Postgresql

阅读更多关于 Increment column value on certain condition in SQL query on Postgresql

问题 I want aggregate my walks with animals by weeks groupping my rows in 1 group if break between weeks was greater than 2 weeks. I have my table: Create table test.walk (animal text, week integer) with row for each walk i want to group: insert into test.walk values ('DOG', 2) insert into test.walk values ('DOG', 3) insert into test.walk values ('DOG', 4) insert into test.walk values ('CAT', 1) insert into test.walk values ('CAT', 1) insert into test.walk values ('CAT', 11) insert into test.walk

Limit the number of rows per ID

阅读更多关于 Limit the number of rows per ID

问题 I am trying to limit the number of rows per case to only 5 rows. Some cases have only 1 or 2 rows but some have 15 or more. This is an example of a stored procedure that I am using to count the number of rows per case. SELECT ROW_NUMBER() OVER(partition by rce.reportruncaseid ORDER BY rce.Reportruncaseid) AS Row, rce.ReportRunCaseId AS CaseId, YEAR(rce.EcoDate) AS EcoYear FROM PhdRpt.ReportCaseList AS rcl INNER JOIN PhdRpt.RptCaseEco AS rce ON rce.ReportId = rcl.ReportId AND rce

Efficiently calculate top-k elements in spark

阅读更多关于 Efficiently calculate top-k elements in spark

问题 I have a dataframe similarly to: +---+-----+-----+ |key|thing|value| +---+-----+-----+ | u1| foo| 1| | u1| foo| 2| | u1| bar| 10| | u2| foo| 10| | u2| foo| 2| | u2| bar| 10| +---+-----+-----+ And want to get a result of: +---+-----+---------+----+ |key|thing|sum_value|rank| +---+-----+---------+----+ | u1| bar| 10| 1| | u1| foo| 3| 2| | u2| foo| 12| 1| | u2| bar| 10| 2| +---+-----+---------+----+ Currently, there is code similarly to: val df = Seq(("u1", "foo", 1), ("u1", "foo", 2), ("u1",