window-functions

Group by repeating attribute

旧街凉风 提交于 2019-11-28 14:21:28
Basically I have a table messages , with user_id field that identifies a user that created the message. When I display a conversation(set of messages) between two users, I want to be able to group the messages by user_id , but in a tricky way: Let's say there are some messages (sorted by created_at desc ): id: 1, user_id: 1 id: 2, user_id: 1 id: 3, user_id: 2 id: 4, user_id: 2 id: 5, user_id: 1 I want to get 3 message groups in the below order: [1,2], [3,4], [5] It should group by *user_id* until it sees a different one and then groups by that one. I'm using PostgreSQL and would be happy to

Window functions: PARTITION BY one column after ORDER BY another

家住魔仙堡 提交于 2019-11-28 12:46:13
Disclaimer: The shown problem is much more general than I expected first. The example below is taken from a solution to another question. But now I was taking this sample for solving many problems more - mostly related to time series (have a look at the "Linked" section in the right bar). So I am trying to explain the problem more generally first: I am using PostgreSQL but I am sure this problem exists in other window function supporting DBMS' (MS SQL Server, Oracle, ...) as well. Window functions can be used to group certain values together by a common attribute or value. For example you can

Time difference in hours and seconds over a partition window in Teradata (Sessionizing Records)

青春壹個敷衍的年華 提交于 2019-11-28 12:11:36
问题 Given a table like this: cust_id time 123 2015-01-01 12:15:05 123 2015-01-01 12:17:06 123 2015-01-02 13:15:08 123 2015-01-02 15:15:10 456 2015-01-01 10:15:05 456 2015-01-01 12:15:07 456 2015-01-01 14:11:10 I would like to calculate the time difference between each preceding record (think lag function) by cust_id . My desired output: cust_id time diff_hours diff_seconds 123 2015-01-01 12:15:05 NULL NULL 123 2015-01-01 12:17:06 0.00 121 123 2015-01-02 13:15:08 1.04 89882 123 2015-01-02 15:15:10

Select all threads and order by the latest one

大城市里の小女人 提交于 2019-11-28 11:12:26
问题 Now that I got the Select all forums and get latest post too.. how? question answered, I am trying to write a query to select all threads in one particular forum and order them by the date of the latest post (column "updated_at"). This is my structure again: forums forum_threads forum_posts ---------- ------------- ----------- id id id parent_forum (NULLABLE) forum_id content name user_id thread_id description title user_id icon views updated_at created_at created_at updated_at last_post_id

Calculating SQL Server ROW_NUMBER() OVER() for a derived table

倖福魔咒の 提交于 2019-11-28 11:00:52
In some other databases (e.g. DB2, or Oracle with ROWNUM ), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance: ROW_NUMBER() OVER() This is particularly useful when used with ordered derived tables, such as: SELECT t.*, ROW_NUMBER() OVER() FROM ( SELECT ... ORDER BY ) t How can this be emulated in SQL Server? I've found people using this trick , but that's wrong, as it will behave non-deterministically with respect to the order from the derived table: -- This order here ---------------------vvvvvvvv SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1)) FROM (

Select random row for each group

*爱你&永不变心* 提交于 2019-11-28 10:16:40
I have a table like this ID ATTRIBUTE 1 A 1 A 1 B 1 C 2 B 2 C 2 C 3 A 3 B 3 C I'd like to select just one random attribute for each ID. The result therefore could look like this (although this is just one of many options ATTRIBUTE B C C This is my attempt on this problem SELECT "ATTRIBUTE" FROM ( SELECT "ID", "ATTRIBUTE", row_number() OVER (PARTITION BY "ID" ORDER BY random()) rownum FROM table ) shuffled WHERE rownum = 1 however, I don't know if this is a good solution, as I need to introduce row numbers, which is a bit cumbersome. Do you have a better one? select distinct on (id) id,

Referencing current row in FILTER clause of window function

為{幸葍}努か 提交于 2019-11-28 09:09:27
In PostgreSQL 9.4 the window functions have the new option of a FILTER to select a sub-set of the window frame for processing. The documentation mentions it, but provides no sample. An online search yields some samples, including from 2ndQuadrant but all that I found were rather trivial examples with constant expressions. What I am looking for is a filter expression that includes the value of the current row. Assume I have a table with a bunch of columns, one of which is of date type: col1 | col2 | dt ------------------------ 1 | a | 2015-07-01 2 | b | 2015-07-03 3 | c | 2015-07-10 4 | d |

PostgreSQL window function: partition by comparison

故事扮演 提交于 2019-11-28 08:49:26
I'm trying to find the way of doing a comparison with the current row in the PARTITION BY clause in a WINDOW function in PostgreSQL query. Imagine I have the short list in the following query of this 5 elements (in the real case, I have thousands or even millions of rows). I am trying to get for each row, the id of the next different element (event column), and the id of the previous different element. WITH events AS( SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date UNION SELECT 3 as id, 13 as

Does Spark know the partitioning key of a DataFrame?

做~自己de王妃 提交于 2019-11-28 06:33:33
I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles. Context: Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so: val df0 = spark .read .format("csv") .option("header", true) .option("delimiter", ";") .option("inferSchema", false) .load("SomeFile.csv")) val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42) df.write .mode(SaveMode.Overwrite) .format("parquet") .option("inferSchema", false) .save("SomeFile.parquet") I am creating 42

Applying a Window function to calculate differences in pySpark

自作多情 提交于 2019-11-28 06:28:54
I am using pySpark , and have set up my dataframe with two columns representing a daily asset price as follows: ind = sc.parallelize(range(1,5)) prices = sc.parallelize([33.3,31.1,51.2,21.3]) data = ind.zip(prices) df = sqlCtx.createDataFrame(data,["day","price"]) I get upon applying df.show() : +---+-----+ |day|price| +---+-----+ | 1| 33.3| | 2| 31.1| | 3| 51.2| | 4| 21.3| +---+-----+ Which is fine and all. I would like to have another column that contains the day-to-day returns of the price column, i.e., something like (price(day2)-price(day1))/(price(day1)) After much research, I am told