window-functions

pyspark: count distinct over a window

牧云@^-^@ 提交于 2019-12-03 05:28:17
I just tried doing a countDistinct over a window and got this error: AnalysisException: u'Distinct window functions are not supported: count(distinct color#1926) Is there a way to do a distinct count over a window in pyspark? Here's some example code: from pyspark.sql.window import Window from pyspark.sql import functions as F #function to calculate number of seconds from number of days days = lambda i: i * 86400 df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00", "orange"), (13, "2017-03-15T12:27:18+00:00", "red"), (25, "2017-03-18T11:27:18+00:00", "red")], ["dollars", "timestampGMT

How to use window functions in PySpark?

徘徊边缘 提交于 2019-12-03 05:16:48
I'm trying to use some windows functions ( ntile and percentRank ) for a data frame but I don't know how to use them. Can anyone help me with this please? In the Python API documentation there are no examples about it. Specifically, I'm trying to get quantiles of a numeric field in my data frame. I'm using spark 1.4.0. To be able to use window function you have to create a window first. Definition is pretty much the same as for normal SQL it means you can define either order, partition or both. First lets create some dummy data: import numpy as np np.random.seed(1) keys = ["foo"] * 10 + ["bar"

Applying Multiple Window Functions On Same Partition

本小妞迷上赌 提交于 2019-12-03 05:09:49
问题 Is it possible to apply multiple window functions to the same partition? (Correct me if I'm not using the right vocabulary) For example you can do SELECT name, first_value() over (partition by name order by date) from table1 But is there a way to do something like: SELECT name, (first_value() as f, last_value() as l (partition by name order by date)) from table1 Where we are applying two functions onto the same window? Reference: http://postgresql.ro/docs/8.4/static/tutorial-window.html 回答1:

Filtering by window function result in Postgresql

守給你的承諾、 提交于 2019-12-03 04:38:56
Ok, initially this was just a joke we had with a friend of mine, but it turned into interesting technical question :) I have the following stuff table: CREATE TABLE stuff ( id serial PRIMARY KEY, volume integer NOT NULL DEFAULT 0, priority smallint NOT NULL DEFAULT 0, ); The table contains the records for all of my stuff, with respective volume and priority (how much I need it). I have a bag with specified volume, say 1000 . I want to select from the table all stuff I can put into a bag, packing the most important stuff first. This seems like the case for using window functions, so here is the

What is the Hamming window for?

六月ゝ 毕业季﹏ 提交于 2019-12-03 03:55:10
问题 I'm working with some code that does a Fourier transform (to calculate the cepstrum of an audio sample). Before it computes the Fourier transform, it applies a Hamming window to the sample: for(int i = 0; i < SEGMENTATION_LENGTH;i++){ timeDomain[i] = (float) (( 0.53836 - ( 0.46164 * Math.cos( TWOPI * (double)i / (double)( SEGMENTATION_LENGTH - 1 ) ) ) ) * frameBuffer[i]); } Why is it doing this? I can't find any reason for it to do this in the code, or online. 回答1: Whenever you do a finite

How to use lag and rangeBetween functions on timestamp values?

拥有回忆 提交于 2019-12-03 03:44:27
I have data that looks like this: userid,eventtime,location_point 4e191908,2017-06-04 03:00:00,18685891 4e191908,2017-06-04 03:04:00,18685891 3136afcb,2017-06-04 03:03:00,18382821 661212dd,2017-06-04 03:06:00,80831484 40e8a7c3,2017-06-04 03:12:00,18825769 I would like to add a new boolean column that marks true if there are 2 or more userid within a 5 minutes window in the same location_point . I had an idea of using lag function to lookup over a window partitioned by the userid and with the range between the current timestamp and the next 5 minutes: from pyspark.sql import functions as F from

How to use a SQL window function to calculate a percentage of an aggregate

☆樱花仙子☆ 提交于 2019-12-02 19:30:26
I need to calculate percentages of various dimensions in a table. I'd like to simplify things by using window functions to calculate the denominator, however I am having an issue because the numerator has to be an aggregate as well. As a simple example, take the following table: create temp table test (d1 text, d2 text, v numeric); insert into test values ('a','x',5), ('a','y',5), ('a','y',10), ('b','x',20); If I just want to calculate the share of each individual row out of d1, then windowing functions work fine: select d1, d2, v/sum(v) over (partition by d1) from test; "b";"x";1.00 "a";"x";0

Running total over repeating group by items based on time in Oracle SQL

ⅰ亾dé卋堺 提交于 2019-12-02 19:01:27
问题 My first post, so bear with me. I want to sum based upon a value that is broken by dates but only want the sum for the dates, not for the the group by item in total. Have been working on this for days, trying to avoid using a cursor but may have to. Here's an example of the data I'm looking at. BTW, this is in Oracle 11g. Key Time Amt ------ ------------------ ------ Null 1-1-2016 00:00 50 Null 1-1-2016 02:00 50 Key1 1-1-2016 04:00 30 Null 1-1-2016 06:00 30 Null 1-1-2016 08:00 30 Key2 1-1

Using window functions in an update statement

我怕爱的太早我们不能终老 提交于 2019-12-02 18:52:38
I have a large PostgreSQL table which I access through Django. Because Django's ORM does not support window functions, I need to bake the results of a window function into the table as a regular column. I want to do something like this: UPDATE table_name SET col1 = ROW_NUMBER() OVER ( PARTITION BY col2 ORDER BY col3 ); But I get ERROR: cannot use window function in UPDATE Can anyone suggest an alternative approach? Passing the window function syntax through Django's .raw() method is not suitable, as it returns a RawQuerySet, which does not support further ORM features such as .filter(), which

Applying Multiple Window Functions On Same Partition

狂风中的少年 提交于 2019-12-02 18:26:14
Is it possible to apply multiple window functions to the same partition? (Correct me if I'm not using the right vocabulary) For example you can do SELECT name, first_value() over (partition by name order by date) from table1 But is there a way to do something like: SELECT name, (first_value() as f, last_value() as l (partition by name order by date)) from table1 Where we are applying two functions onto the same window? Reference: http://postgresql.ro/docs/8.4/static/tutorial-window.html Adriaan Stander Can you not just use the window per selection Something like SELECT name, first_value() OVER