window-functions

How to apply lag function on streaming dataframe?

二次信任 提交于 2019-12-10 19:36:11
问题 I have a streaming dataframe having three columns time,col1,col2. I have to apply lag function on column2. I have tried following query. val w = org.apache.spark.sql.expressions.Window.oderBy("time") df.select(col("time"),col("col1"),lag("col3",1).over(w)) But It gives following exception org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets How can i achieve this? Thanks in advance. 来源: https://stackoverflow.com/questions/46036845

select nearest neighbours

可紊 提交于 2019-12-10 18:27:01
问题 consider the following data: category | index | value ------------------------- cat 1 | 1 | 2 cat 1 | 2 | 3 cat 1 | 3 | cat 1 | 4 | 1 cat 2 | 1 | 5 cat 2 | 2 | cat 2 | 3 | cat 2 | 4 | 6 cat 3 | 1 | cat 3 | 2 | cat 3 | 3 | 2 cat 3 | 4 | 1 I am trying to fill in the holes, so that hole = avg(value) of 2 nearest neighbours with non-null values within a category: category | index | value ------------------------- cat 1 | 1 | 2 cat 1 | 2 | 3 cat 1 | 3 | 2* cat 1 | 4 | 1 cat 2 | 1 | 5 cat 2 | 2 | 5

Find movies with highest number of awards in certain year - code duplication

♀尐吖头ヾ 提交于 2019-12-10 17:14:55
问题 I am trying to write a query (PostgreSQL) to get "Movies with highest number of awards in year 2012." I have following tables: CREATE TABLE Award( ID_AWARD bigserial CONSTRAINT Award_pk PRIMARY KEY, award_name VARCHAR(90), category VARCHAR(90), award_year integer, CONSTRAINT award_unique UNIQUE (award_name, category, award_year)); CREATE TABLE AwardWinner( ID_AWARD integer, ID_ACTOR integer, ID_MOVIE integer, CONSTRAINT AwardWinner_pk PRIMARY KEY (ID_AWARD)); And I written following query,

Dynamic row range when calculating moving sum/average using window functions (SQL Server)

十年热恋 提交于 2019-12-10 15:13:05
问题 I'm currently working on a sample script which allows me to calculate the sum of the previous two rows and the current row. However, I would like to make the number '2' as a variable. I've tried declaring a variable, or directly casting in the query, yet a syntax error always pops up. Is there a possible solution? DECLARE @myTable TABLE (myValue INT) INSERT INTO @myTable ( myValue ) VALUES ( 5) INSERT INTO @myTable ( myValue ) VALUES ( 6) INSERT INTO @myTable ( myValue ) VALUES ( 7) INSERT

Why do you need to include a field in GROUP BY when using OVER (PARTITION BY x)?

流过昼夜 提交于 2019-12-10 15:05:25
问题 I have a table for which I want to do a simple sum of a field, grouped by two columns. I then want the total for all values for each year_num. See example: http://rextester.com/QSLRS68794 This query is throwing: "42803: column "foo.num_cust" must appear in the GROUP BY clause or be used in an aggregate function", and I cannot figure out why. Why would an aggregate function using the OVER (PARTITION BY x) require the summed field to be in GROUP BY?? select year_num ,age_bucket ,sum(num_cust) -

Postgresql: Grouping with limit on group size using window functions

淺唱寂寞╮ 提交于 2019-12-10 14:53:22
问题 Is there a way in Postgresql to write a query which groups rows based on a column with a limit without discarding additional rows. Say I've got a table with three columns id, color, score with the following rows 1 red 10.0 2 red 7.0 3 red 3.0 4 blue 5.0 5 green 4.0 6 blue 2.0 7 blue 1.0 I can get a grouping based on color with window functions with the following query SELECT * FROM ( SELECT id, color, score, rank() OVER (PARTITION BY color ORDER BY score DESC) FROM grouping_test ) AS foo

PySpark: retrieve mean and the count of values around the mean for groups within a dataframe

自闭症网瘾萝莉.ら 提交于 2019-12-10 13:33:41
问题 My raw data comes in a tabular format. It contains observations from different variables. Each observation with the variable name, the timestamp and the value at that time. Variable [string], Time [datetime], Value [float] The data is stored as Parquet in HDFS and loaded into a Spark Dataframe (df). From that dataframe. Now I want to calculate default statistics like Mean, Standard Deviation and others for each variable. Afterwards, once the Mean has been retrieved, I want to filter/count

Increment column value on certain condition in SQL query on Postgresql

无人久伴 提交于 2019-12-10 12:08:27
问题 I want aggregate my walks with animals by weeks groupping my rows in 1 group if break between weeks was greater than 2 weeks. I have my table: Create table test.walk (animal text, week integer) with row for each walk i want to group: insert into test.walk values ('DOG', 2) insert into test.walk values ('DOG', 3) insert into test.walk values ('DOG', 4) insert into test.walk values ('CAT', 1) insert into test.walk values ('CAT', 1) insert into test.walk values ('CAT', 11) insert into test.walk

Limit the number of rows per ID

隐身守侯 提交于 2019-12-10 11:18:51
问题 I am trying to limit the number of rows per case to only 5 rows. Some cases have only 1 or 2 rows but some have 15 or more. This is an example of a stored procedure that I am using to count the number of rows per case. SELECT ROW_NUMBER() OVER(partition by rce.reportruncaseid ORDER BY rce.Reportruncaseid) AS Row, rce.ReportRunCaseId AS CaseId, YEAR(rce.EcoDate) AS EcoYear FROM PhdRpt.ReportCaseList AS rcl INNER JOIN PhdRpt.RptCaseEco AS rce ON rce.ReportId = rcl.ReportId AND rce

Efficiently calculate top-k elements in spark

戏子无情 提交于 2019-12-10 10:14:15
问题 I have a dataframe similarly to: +---+-----+-----+ |key|thing|value| +---+-----+-----+ | u1| foo| 1| | u1| foo| 2| | u1| bar| 10| | u2| foo| 10| | u2| foo| 2| | u2| bar| 10| +---+-----+-----+ And want to get a result of: +---+-----+---------+----+ |key|thing|sum_value|rank| +---+-----+---------+----+ | u1| bar| 10| 1| | u1| foo| 3| 2| | u2| foo| 12| 1| | u2| bar| 10| 2| +---+-----+---------+----+ Currently, there is code similarly to: val df = Seq(("u1", "foo", 1), ("u1", "foo", 2), ("u1",