window-functions

Select all threads and order by the latest one

ぐ巨炮叔叔 提交于 2019-11-29 16:56:03
Now that I got the Select all forums and get latest post too.. how? question answered, I am trying to write a query to select all threads in one particular forum and order them by the date of the latest post (column "updated_at"). This is my structure again: forums forum_threads forum_posts ---------- ------------- ----------- id id id parent_forum (NULLABLE) forum_id content name user_id thread_id description title user_id icon views updated_at created_at created_at updated_at last_post_id (NULLABLE) I tried writing this query, and it works.. but not as expected: It doesn't order the threads

SparkSQL - Lag function?

浪子不回头ぞ 提交于 2019-11-29 11:28:14
I see in this DataBricks post , there is support for window functions in SparkSql, in particular I'm trying to use the lag() window function. I have rows of credit card transactions, and I've sorted them, now I want to iterate over the rows, and for each row display the amount of the transaction, and the difference of the current row's amount and the preceding row's amount. Following the DataBricks post, I've come up with this query, but it's throwing an exception at me and I can't quite undestand why.. This is in PySpark.. tx is my dataframe already created at registered as a temp table. test

User defined function to be applied to Window in PySpark?

夙愿已清 提交于 2019-11-29 11:17:44
I am trying to apply a user defined function to Window in PySpark. I have read that UDAF might be the way to to go, but I was not able to find anything concrete. To give an example (taken from here: Xinh's Tech Blog and modified for PySpark): from pyspark import SparkConf from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import avg spark = SparkSession.builder.master("local").config(conf=SparkConf()).getOrCreate() a = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]], ['ind', "state"]) customers = spark.createDataFrame(

Bad optimization/planning on Postgres window-based queries (partition by(, group by?)) - 1000x speedup

假装没事ソ 提交于 2019-11-29 10:17:38
问题 We are running Postgres 9.3.5. (07/2014) We have quite some complex datawarehouse/reporting setup in place (ETL, materialized views, indexing, aggregations, analytical functions, ...). What I discovered right now may be difficult to implement in the optimizer (?), but it makes a huge difference in performance (only sample code with huge similarity to our query to reduce unnecessary complexity): create view foo as select sum(s.plan) over w_pyl as pyl_plan, -- money planned to spend in this pot

Partitioning by multiple columns in Spark SQL

别来无恙 提交于 2019-11-29 09:34:26
问题 With Spark SQL's window functions, I need to partition by multiple columns to run my data queries, as follows: val w = Window.partitionBy($"a").partitionBy($"b").rangeBetween(-100, 0) I currently do not have a test environment (working on settings this up), but as a quick question, is this currently supported as a part of Spark SQL's window functions, or will this not work? 回答1: This won't work. The second partitionBy will overwrite the first one. Both partition columns have to be specified

spark sql window function lag

戏子无情 提交于 2019-11-29 01:40:57
I am looking at the window slide function for a Spark DataFrame in Spark SQL, Scala. I have a dataframe with columns Col1,Col1,Col1,date. Col1 Col2 Col3 date volume new_col 201601 100.5 201602 120.6 100.5 201603 450.2 120.6 201604 200.7 450.2 201605 121.4 200.7` Now I want to add a new column with name(new_col) with one row slided down, as shown above. I tried below option to use the window function. val windSldBrdrxNrx_df = df.withColumn("Prev_brand_rx", lag("Prev_brand_rx",1)) Can anyone please help me to how to do this. You are doing correctly all you missed is over(window expression) on

Speed of paged queries in Oracle

好久不见. 提交于 2019-11-28 20:50:39
This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application: Regular queries with a "fallback" limit Sorted and paged queries Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries. Case 1 is really simple. We just add an additional ROWNUM filter: WHERE ... AND ROWNUM < ? That's quite fast, as Oracle's CBO will

Select Row number in postgres

狂风中的少年 提交于 2019-11-28 16:55:56
How to select row number in postgres. I tried this: select row_number() over (ORDER BY cgcode_odc_mapping_id)as rownum, cgcode_odc_mapping_id from access_odc.access_odc_mapping_tb order by cgcode_odc_mapping_id and got this error: ERROR: syntax error at or near "over" LINE 1: select row_number() over (ORDER BY cgcode_odc_mapping_id)as I have checked these pages : How to show row numbers in PostgreSQL query? This is my query: select row_number() over (ORDER BY cgcode_odc_mapping_id)as rownum,cgcode_odc_mapping_id from access_odc.access_odc_mapping_tb order by cgcode_odc_mapping_id this is the

What is ROWS UNBOUNDED PRECEDING used for in Teradata?

微笑、不失礼 提交于 2019-11-28 16:42:14
I am just starting on Teradata and I have come across an Ordered Analytical Function called "Rows unbounded preceding" in Teradata. I tried several sites to learn about the function but all of them uses a complicated example explaining the same. Could you please provide me with a naive example so that I can get the basics clear. It's the "frame" or "range" clause of window functions, which are part of the SQL standard and implemented in many databases, including Teradata. A simple example would be to calculate the average amount in a frame of three days. I'm using PostgreSQL syntax for the

Ordered count of consecutive repeats / duplicates

时光毁灭记忆、已成空白 提交于 2019-11-28 14:31:07
I highly doubt I'm doing this in the most efficient manner, which is why I tagged plpgsql on here. I need to run this on 2 billion rows for a thousand measurement systems . You have measurement systems that often report the previous value when they lose connectivity, and they lose connectivity for spurts often but sometimes for a long time. You need to aggregate but when you do so, you need to look at how long it was repeating and make various filters based on that information. Say you are measuring mpg on a car but it's stuck at 20 mpg for an hour than moves around to 20.1 and so on. You'll