window-functions | 易学教程

Create a group id over a window in Spark Dataframe

阅读更多关于 Create a group id over a window in Spark Dataframe

问题 I have a dataframe where I want to give id's in each Window partition. For example I have id | col | 1 | a | 2 | a | 3 | b | 4 | c | 5 | c | So I want (based on grouping with column col) id | group | 1 | 1 | 2 | 1 | 3 | 2 | 4 | 3 | 5 | 3 | I want to use a window function but I cannot find anyway to assign an Id to each window. I need something like: w = Window().partitionBy('col') df = df.withColumn("group", id().over(w)) Is there any way to achive somethong like that. (I cannot simply use

Select data for 15 minute windows - PostgreSQL

阅读更多关于 Select data for 15 minute windows - PostgreSQL

问题 Right so I have a table such as this in PostgreSQL: timestamp duration 2013-04-03 15:44:58 4 2013-04-03 15:56:12 2 2013-04-03 16:13:17 9 2013-04-03 16:16:30 3 2013-04-03 16:29:52 1 2013-04-03 16:38:25 1 2013-04-03 16:41:37 9 2013-04-03 16:44:49 1 2013-04-03 17:01:07 9 2013-04-03 17:07:48 1 2013-04-03 17:11:00 2 2013-04-03 17:11:16 2 2013-04-03 17:15:17 1 2013-04-03 17:16:53 4 2013-04-03 17:20:37 9 2013-04-03 17:20:53 3 2013-04-03 17:25:48 3 2013-04-03 17:29:26 1 2013-04-03 17:32:38 9 2013-04

How to add a running count to rows in a 'streak' of consecutive days

阅读更多关于 How to add a running count to rows in a 'streak' of consecutive days

问题 Thanks to Mike for the suggestion to add the create/insert statements. create table test ( pid integer not null, date date not null, primary key (pid, date) ); insert into test values (1,'2014-10-1') , (1,'2014-10-2') , (1,'2014-10-3') , (1,'2014-10-5') , (1,'2014-10-7') , (2,'2014-10-1') , (2,'2014-10-2') , (2,'2014-10-3') , (2,'2014-10-5') , (2,'2014-10-7'); I want to add a new column that is 'days in current streak' so the result would look like: pid | date | in_streak -------|-----------|

Average stock history table

阅读更多关于 Average stock history table

I have a table that tracks changes in stocks through time for some stores and products. The value is the absolute stock, but we only insert a new row when a change in stock occurs. This design was to keep the table small, because it is expected to grow rapidly. This is an example schema and some test data: CREATE TABLE stocks ( id serial NOT NULL, store_id integer NOT NULL, product_id integer NOT NULL, date date NOT NULL, value integer NOT NULL, CONSTRAINT stocks_pkey PRIMARY KEY (id), CONSTRAINT stocks_store_id_product_id_date_key UNIQUE (store_id, product_id, date) ); insert into stocks

Last_value window function doesn't work properly

阅读更多关于 Last_value window function doesn't work properly

Last_value window function doesn't work properly. CREATE TABLE EXAMP2 ( CUSTOMER_ID NUMBER(38) NOT NULL, VALID_FROM DATE NOT NULL ); Customer_id Valid_from ------------------------------------- 9775 06.04.2013 01:34:16 9775 06.04.2013 20:34:00 9775 12.04.2013 11:07:01 -------------------------------------- select DISTINCT LAST_VALUE(VALID_FROM) OVER (partition by customer_id ORDER BY VALID_FROM ASC) rn from examp1; When I use LAST_VALUE then I get following rows: 06.04.2013 20:34:00 06.04.2013 01:34:16 12.04.2013 11:07:01 When I use FIRST_VALUE then I get following rows: select DISTINCT FIRST

Bad optimization/planning on Postgres window-based queries (partition by(, group by?)) - 1000x speedup

阅读更多关于 Bad optimization/planning on Postgres window-based queries (partition by(, group by?)) - 1000x speedup

We are running Postgres 9.3.5. (07/2014) We have quite some complex datawarehouse/reporting setup in place (ETL, materialized views, indexing, aggregations, analytical functions, ...). What I discovered right now may be difficult to implement in the optimizer (?), but it makes a huge difference in performance (only sample code with huge similarity to our query to reduce unnecessary complexity): create view foo as select sum(s.plan) over w_pyl as pyl_plan, -- money planned to spend in this pot/loc/year sum(s.booked) over w_pyl as pyl_booked, -- money already booked in this pot/loc/year -- money

Partitioning by multiple columns in Spark SQL

阅读更多关于 Partitioning by multiple columns in Spark SQL

With Spark SQL's window functions, I need to partition by multiple columns to run my data queries, as follows: val w = Window.partitionBy($"a").partitionBy($"b").rangeBetween(-100, 0) I currently do not have a test environment (working on settings this up), but as a quick question, is this currently supported as a part of Spark SQL's window functions, or will this not work? This won't work. The second partitionBy will overwrite the first one. Both partition columns have to be specified in the same call: val w = Window.partitionBy($"a", $"b").rangeBetween(-100, 0) 来源： https://stackoverflow.com

Speed of paged queries in Oracle

阅读更多关于 Speed of paged queries in Oracle

问题 This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application: Regular queries with a "fallback" limit Sorted and paged queries Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries. Case 1 is really simple. We

Calculating the Weighted Average Cost of products stock

阅读更多关于 Calculating the Weighted Average Cost of products stock

I have to calculate my products stock cost, so for every product after each buy, i have to recalculate the Weighted Average Cost . I got a view thats bring me the current product's stock after each in/out: document_type document_date product_id qty_out qty_in price row_num stock_balance SI 01/01/2014 52 0 600 1037.28 1 600 SI 01/01/2014 53 0 300 1357.38 2 300 LC 03/02/2014 53 100 0 1354.16 3 200 LC 03/02/2014 53 150 0 1355.25 4 50 LC 03/02/2014 52 100 0 1035.26 5 500 LC 03/02/2014 52 200 0 1035.04 6 300 LF 03/02/2014 53 0 1040 1356.44 7 1090 LF 03/02/2014 52 0 1560 1045 8 1860 LC 04/02/2014 52

pyspark: rolling average using timeseries data

阅读更多关于 pyspark: rolling average using timeseries data

I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I was initially looking at the pyspark.sql.functions.window function, but that bins the data by week. Here's an example: %pyspark import datetime from pyspark.sql import functions as F df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars", "datestring"]) df2 = df1.withColumn('timestampGMT', df1.datestring.cast('timestamp')) w = df2.groupBy(F