window-functions | 易学教程

Group by repeating attribute

阅读更多关于 Group by repeating attribute

Basically I have a table messages , with user_id field that identifies a user that created the message. When I display a conversation(set of messages) between two users, I want to be able to group the messages by user_id , but in a tricky way: Let's say there are some messages (sorted by created_at desc ): id: 1, user_id: 1 id: 2, user_id: 1 id: 3, user_id: 2 id: 4, user_id: 2 id: 5, user_id: 1 I want to get 3 message groups in the below order: [1,2], [3,4], [5] It should group by *user_id* until it sees a different one and then groups by that one. I'm using PostgreSQL and would be happy to

Window functions: PARTITION BY one column after ORDER BY another

阅读更多关于 Window functions: PARTITION BY one column after ORDER BY another

Disclaimer: The shown problem is much more general than I expected first. The example below is taken from a solution to another question. But now I was taking this sample for solving many problems more - mostly related to time series (have a look at the "Linked" section in the right bar). So I am trying to explain the problem more generally first: I am using PostgreSQL but I am sure this problem exists in other window function supporting DBMS' (MS SQL Server, Oracle, ...) as well. Window functions can be used to group certain values together by a common attribute or value. For example you can

Time difference in hours and seconds over a partition window in Teradata (Sessionizing Records)

阅读更多关于 Time difference in hours and seconds over a partition window in Teradata (Sessionizing Records)

问题 Given a table like this: cust_id time 123 2015-01-01 12:15:05 123 2015-01-01 12:17:06 123 2015-01-02 13:15:08 123 2015-01-02 15:15:10 456 2015-01-01 10:15:05 456 2015-01-01 12:15:07 456 2015-01-01 14:11:10 I would like to calculate the time difference between each preceding record (think lag function) by cust_id . My desired output: cust_id time diff_hours diff_seconds 123 2015-01-01 12:15:05 NULL NULL 123 2015-01-01 12:17:06 0.00 121 123 2015-01-02 13:15:08 1.04 89882 123 2015-01-02 15:15:10

Select all threads and order by the latest one

阅读更多关于 Select all threads and order by the latest one

问题 Now that I got the Select all forums and get latest post too.. how? question answered, I am trying to write a query to select all threads in one particular forum and order them by the date of the latest post (column "updated_at"). This is my structure again: forums forum_threads forum_posts ---------- ------------- ----------- id id id parent_forum (NULLABLE) forum_id content name user_id thread_id description title user_id icon views updated_at created_at created_at updated_at last_post_id

Calculating SQL Server ROW_NUMBER() OVER() for a derived table

阅读更多关于 Calculating SQL Server ROW_NUMBER() OVER() for a derived table

In some other databases (e.g. DB2, or Oracle with ROWNUM ), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance: ROW_NUMBER() OVER() This is particularly useful when used with ordered derived tables, such as: SELECT t.*, ROW_NUMBER() OVER() FROM ( SELECT ... ORDER BY ) t How can this be emulated in SQL Server? I've found people using this trick , but that's wrong, as it will behave non-deterministically with respect to the order from the derived table: -- This order here ---------------------vvvvvvvv SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1)) FROM (

Select random row for each group

阅读更多关于 Select random row for each group

I have a table like this ID ATTRIBUTE 1 A 1 A 1 B 1 C 2 B 2 C 2 C 3 A 3 B 3 C I'd like to select just one random attribute for each ID. The result therefore could look like this (although this is just one of many options ATTRIBUTE B C C This is my attempt on this problem SELECT "ATTRIBUTE" FROM ( SELECT "ID", "ATTRIBUTE", row_number() OVER (PARTITION BY "ID" ORDER BY random()) rownum FROM table ) shuffled WHERE rownum = 1 however, I don't know if this is a good solution, as I need to introduce row numbers, which is a bit cumbersome. Do you have a better one? select distinct on (id) id,

Referencing current row in FILTER clause of window function

阅读更多关于 Referencing current row in FILTER clause of window function

In PostgreSQL 9.4 the window functions have the new option of a FILTER to select a sub-set of the window frame for processing. The documentation mentions it, but provides no sample. An online search yields some samples, including from 2ndQuadrant but all that I found were rather trivial examples with constant expressions. What I am looking for is a filter expression that includes the value of the current row. Assume I have a table with a bunch of columns, one of which is of date type: col1 | col2 | dt ------------------------ 1 | a | 2015-07-01 2 | b | 2015-07-03 3 | c | 2015-07-10 4 | d |

PostgreSQL window function: partition by comparison

阅读更多关于 PostgreSQL window function: partition by comparison

I'm trying to find the way of doing a comparison with the current row in the PARTITION BY clause in a WINDOW function in PostgreSQL query. Imagine I have the short list in the following query of this 5 elements (in the real case, I have thousands or even millions of rows). I am trying to get for each row, the id of the next different element (event column), and the id of the previous different element. WITH events AS( SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date UNION SELECT 3 as id, 13 as

Does Spark know the partitioning key of a DataFrame?

阅读更多关于 Does Spark know the partitioning key of a DataFrame?

I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles. Context: Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so: val df0 = spark .read .format("csv") .option("header", true) .option("delimiter", ";") .option("inferSchema", false) .load("SomeFile.csv")) val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42) df.write .mode(SaveMode.Overwrite) .format("parquet") .option("inferSchema", false) .save("SomeFile.parquet") I am creating 42

Applying a Window function to calculate differences in pySpark

阅读更多关于 Applying a Window function to calculate differences in pySpark

I am using pySpark , and have set up my dataframe with two columns representing a daily asset price as follows: ind = sc.parallelize(range(1,5)) prices = sc.parallelize([33.3,31.1,51.2,21.3]) data = ind.zip(prices) df = sqlCtx.createDataFrame(data,["day","price"]) I get upon applying df.show() : +---+-----+ |day|price| +---+-----+ | 1| 33.3| | 2| 31.1| | 3| 51.2| | 4| 21.3| +---+-----+ Which is fine and all. I would like to have another column that contains the day-to-day returns of the price column, i.e., something like (price(day2)-price(day1))/(price(day1)) After much research, I am told