问题
I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have a scenario for some finance data coming from kafka topic. like companyId, year , quarter , sales ,prev_sales data.
val kafkaDf = sc.parallelize(Seq((15,2016, 4, 100.5,"")).toDF("companyId", "year","quarter", "sales","prev_sales")
I need to prev_sales with previous year same quarter data from cassandra table which is something like below
val cassandraTabledf = sc.parallelize(Seq(
(15,2016, 3, 120.6, 320.6),
(15,2016, 2, 450.2,650.2),
(15,2016, 1, 200.7,700.7),
(15,2015, 4, 221.4,400),
(15,2015, 3, 320.6,300),
(15,2015, 2, 650.2,200),
(15,2015, 1, 700.7,100))).toDF("companyId", "year","quarter", "sales","prev_sales")
i.e. for Seq((15,2016, 4, 100.5,"") data it should be 2015 year, quarter 4 data i.e. 221.4
so new data is
(15,2016, 4, 100.5,221.4)
how to do/achieve this one ? We can do querying explicitly , but is there any way to use "lag" function using join on cassandra table ?
回答1:
I don't think it required any leg and lead functions. You can get you desire output by join too. Check below code for reference:
Note: I have added more data in kafkaDF for more understanding.
scala> kafkaDf.show(false)
+---------+----+-------+-----+----------+
|companyId|year|quarter|sales|prev_sales|
+---------+----+-------+-----+----------+
|15 |2016|4 |100.5| |
|15 |2016|1 |115.8| |
|15 |2016|3 |150.1| |
+---------+----+-------+-----+----------+
scala> cassandraTabledf.show
+---------+----+-------+-----+----------+
|companyId|year|quarter|sales|prev_sales|
+---------+----+-------+-----+----------+
| 15|2016| 3|120.6| 320.6|
| 15|2016| 2|450.2| 650.2|
| 15|2016| 1|200.7| 700.7|
| 15|2015| 4|221.4| 400|
| 15|2015| 3|320.6| 300|
| 15|2015| 2|650.2| 200|
| 15|2015| 1|700.7| 100|
+---------+----+-------+-----+----------+
scala>kafkaDf.alias("k").join(
cassandraTabledf.alias("c"),
col("k.companyId") === col("c.companyId") &&
col("k.quarter") === col("c.quarter") &&
(col("k.year") - 1) === col("c.year"),
"left"
)
.drop("prev_sales")
.select(col("k.*"), col("c.sales").alias("prev_sales"))
.withColumn("prev_sales", when(col("prev_sales").isNull, col("sales")).otherwise(col("prev_sales")))
.show()
+---------+----+-------+-----+----------+
|companyId|year|quarter|sales|prev_sales|
+---------+----+-------+-----+----------+
| 15|2016| 1|115.8| 700.7|
| 15|2016| 3|150.1| 320.6|
| 15|2016| 4|100.5| 221.4|
+---------+----+-------+-----+----------+
来源:https://stackoverflow.com/questions/59558469/how-to-use-lag-lead-function-in-spark-streaming-application