window-functions

row_number() over partition in hql

人盡茶涼 提交于 2019-12-04 07:54:29
What is the equivalent of row_number() over partition in hql I have the following query in hql: select s.Companyname, p.Productname, sum(od.Unitprice * od.Quantity - od.Discount) as SalesAmount FROM OrderDetails as od inner join od.Orders as o inner join od.Products as p " + "inner join p.Suppliers as s" + " where o.Orderdate between '2010/01/01' and '2014/01/01' GROUP BY s.Companyname,p.Productname" I want to do partition by s.Companyname where RowNumber <= n . As far as I know you cannot use row_number() neither in HQL nor in JPQL . I propose to use a native SQL query in this case:

Selecting positive aggregate value and ignoring negative in Postgres SQL

微笑、不失礼 提交于 2019-12-04 07:51:16
I must apply a certain transformation fn(argument) . Here argument is equal to value , but not when it is negative. When you get a first negative value , then you "wait" until it sums up with consecutive values and this sum becomes positive. Then you do fn(argument) . See the table I want to get: value argument --------------------- 2 2 3 3 -10 0 4 0 3 0 10 7 1 1 I could have summed all values and apply fn to the sum, but fn can be different for different rows and it is essential to know the row number to choose a concrete fn. As want a Postgres SQL solution, looks like window functions fit,

Spark Scala : Getting Cumulative Sum (Running Total) Using Analytical Functions

末鹿安然 提交于 2019-12-04 05:02:24
I am implementing the Cumulative Sum in Spark using Window Function. But the order of records input is not maintained while applying the window partition function Input data: val base = List(List("10", "MILLER", "1300", "2017-11-03"), List("10", "Clark", "2450", "2017-12-9"), List("10", "King", "5000", "2018-01-28"), List("30", "James", "950", "2017-10-18"), List("30", "Martin", "1250", "2017-11-21"), List("30", "Ward", "1250", "2018-02-05")) .map(row => (row(0), row(1), row(2), row(3))) val DS1 = base.toDF("dept_no", "emp_name", "sal", "date") DS1.show() +-------+--------+----+----------+

Fill in missing rows when aggregating over multiple fields in Postgres

戏子无情 提交于 2019-12-04 03:57:04
问题 I am aggregating sales for a set of products per day using Postgres and need to know not just when sales do happen, but also when they do not for further processing. SELECT sd.date, COUNT(sd.sale_id) AS sales, sd.product FROM sales_data sd -- sales per product, per day GROUP BY sd.product, sd.date ORDER BY sd.product, sd.date This produces the following: date | sales | product ------------+-------+------------------- 2017-08-17 | 10 | soap 2017-08-19 | 2 | soap 2017-08-20 | 5 | soap 2017-08

Spark Task not serializable with lag Window function

柔情痞子 提交于 2019-12-04 02:17:21
I've noticed that after I use a Window function over a DataFrame if I call a map() with a function, Spark returns a "Task not serializable" Exception This is my code: val hc:org.apache.spark.sql.hive.HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hc.implicits._ import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ def f():String = "test" case class P(name:String,surname:String) val lag_result:org.apache.spark.sql.Column = lag($"name",1).over(Window.partitionBy($"surname")) val lista:List[P] = List(P("N1","S1"),P("N2","S2"),P("N2","S2")) val

Jump SQL gap over specific condition & proper lead() usage

岁酱吖の 提交于 2019-12-04 02:03:59
问题 (PostgreSQL 8.4) Continuing with my previous example, I wish to further my understanding of gaps-and-islands processing with Window-functions. Consider the following table and data: CREATE TABLE T1 ( id SERIAL PRIMARY KEY, val INT, -- some device status INT -- 0=OFF, 1=ON ); INSERT INTO T1 (val, status) VALUES (10, 0); INSERT INTO T1 (val, status) VALUES (11, 0); INSERT INTO T1 (val, status) VALUES (11, 1); INSERT INTO T1 (val, status) VALUES (10, 1); INSERT INTO T1 (val, status) VALUES (11,

first_value windowing function in pyspark

白昼怎懂夜的黑 提交于 2019-12-04 01:38:39
问题 I am using pyspark 1.5 getting my data from Hive tables and trying to use windowing functions. According to this there exists an analytic function called firstValue that will give me the first non-null value for a given window. I know this exists in Hive but I can not find this in pyspark anywhere. Is there a way to implement this given that pyspark won't allow UserDefinedAggregateFunctions (UDAFs)? 回答1: Spark >= 2.0 : first takes an optional ignorenulls argument which can mimic the behavior

Apache Spark Window function with nested column

点点圈 提交于 2019-12-03 21:15:19
I'm not sure this is a bug (or just incorrect syntax). I searched around and didn't see this mentioned elsewhere so I'm asking here before filing a bug report. I'm trying to use a Window function partitioned on a nested column. I've created a small example below demonstrating the problem. import sqlContext.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") .withColumn("Data", struct("A", "B", "C")).drop("A").drop("B").drop("C") val winSpec = Window.partitionBy("Data

Average stock history table

北慕城南 提交于 2019-12-03 20:25:45
问题 I have a table that tracks changes in stocks through time for some stores and products. The value is the absolute stock, but we only insert a new row when a change in stock occurs. This design was to keep the table small, because it is expected to grow rapidly. This is an example schema and some test data: CREATE TABLE stocks ( id serial NOT NULL, store_id integer NOT NULL, product_id integer NOT NULL, date date NOT NULL, value integer NOT NULL, CONSTRAINT stocks_pkey PRIMARY KEY (id),

pyspark: count distinct over a window

时光毁灭记忆、已成空白 提交于 2019-12-03 17:07:33
问题 I just tried doing a countDistinct over a window and got this error: AnalysisException: u'Distinct window functions are not supported: count(distinct color#1926) Is there a way to do a distinct count over a window in pyspark? Here's some example code: from pyspark.sql.window import Window from pyspark.sql import functions as F #function to calculate number of seconds from number of days days = lambda i: i * 86400 df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00", "orange"), (13,