how to use lag/lead function in spark streaming application?

☆樱花仙子☆ 提交于 2020-01-10 05:26:18

问题


I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.

I have a scenario for some finance data coming from kafka topic. like companyId, year , quarter , sales ,prev_sales data.

val kafkaDf = sc.parallelize(Seq((15,2016, 4, 100.5,"")).toDF("companyId", "year","quarter", "sales","prev_sales")

I need to prev_sales with previous year same quarter data from cassandra table which is something like below

val cassandraTabledf = sc.parallelize(Seq(
  (15,2016, 3, 120.6, 320.6),
  (15,2016, 2, 450.2,650.2),
  (15,2016, 1, 200.7,700.7),
  (15,2015, 4, 221.4,400),
  (15,2015, 3, 320.6,300),
  (15,2015, 2, 650.2,200),
  (15,2015, 1, 700.7,100))).toDF("companyId", "year","quarter", "sales","prev_sales")

i.e. for Seq((15,2016, 4, 100.5,"") data it should be 2015 year, quarter 4 data i.e. 221.4

so new data is

(15,2016, 4, 100.5,221.4)

how to do/achieve this one ? We can do querying explicitly , but is there any way to use "lag" function using join on cassandra table ?


回答1:


I don't think it required any leg and lead functions. You can get you desire output by join too. Check below code for reference:

Note: I have added more data in kafkaDF for more understanding.

scala> kafkaDf.show(false)
+---------+----+-------+-----+----------+
|companyId|year|quarter|sales|prev_sales|
+---------+----+-------+-----+----------+
|15       |2016|4      |100.5|          |
|15       |2016|1      |115.8|          |
|15       |2016|3      |150.1|          |
+---------+----+-------+-----+----------+


scala> cassandraTabledf.show
+---------+----+-------+-----+----------+
|companyId|year|quarter|sales|prev_sales|
+---------+----+-------+-----+----------+
|       15|2016|      3|120.6|     320.6|
|       15|2016|      2|450.2|     650.2|
|       15|2016|      1|200.7|     700.7|
|       15|2015|      4|221.4|       400|
|       15|2015|      3|320.6|       300|
|       15|2015|      2|650.2|       200|
|       15|2015|      1|700.7|       100|
+---------+----+-------+-----+----------+


scala>kafkaDf.alias("k").join(
                              cassandraTabledf.alias("c"), 
                              col("k.companyId") === col("c.companyId") && 
                              col("k.quarter") === col("c.quarter") && 
                              (col("k.year") - 1) === col("c.year"),
                              "left"
                             )
                       .drop("prev_sales")
                       .select(col("k.*"), col("c.sales").alias("prev_sales"))
                       .withColumn("prev_sales", when(col("prev_sales").isNull, col("sales")).otherwise(col("prev_sales")))
                       .show()
+---------+----+-------+-----+----------+
|companyId|year|quarter|sales|prev_sales|
+---------+----+-------+-----+----------+
|       15|2016|      1|115.8|     700.7|
|       15|2016|      3|150.1|     320.6|
|       15|2016|      4|100.5|     221.4|
+---------+----+-------+-----+----------+


来源:https://stackoverflow.com/questions/59558469/how-to-use-lag-lead-function-in-spark-streaming-application

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!