问题
I have a dataframe which has start_date, end_date, sales_target. I have added code to identify the number of quarters between the date range, and accordingly able to split the sales_target across the number of quarters, using some a UDF.
df = sqlContext.createDataFrame([("2020-01-01","2020-12-31","15"),("2020-04-01","2020-12-31","11"),("2020-07-01","2020-12-31","3")], ["start_date","end_date","sales_target"])
+----------+----------+------------+
|start_date| end_date |sales_target|
+----------+----------+------------+
|2020-01-01|2020-12-31| 15|
|2020-04-01|2020-12-31| 11|
|2020-07-01|2020-12-31| 3|
+----------+----------+------------+
Following is the dataframe after calculating number of quarters, and splitting the sales_target using an UDF function.
spark.sql('select *, round(months_between(end_date, start_date)/3) as noq from df_temp').createOrReplaceTempView("df_temp")
spark.sql("select *, st_udf(cast(sales_target as integer), cast(noq as integer)) as sales_target from df_temp").createOrReplaceTempView("df_temp")
+----------+----------+--------+---------------+
|start_date| end_date |num_qtrs|sales_target_n |
+----------+----------+--------+---------------+
|2020-01-01|2020-12-31| 4| [4,4,4,3] |
|2020-04-01|2020-12-31| 3| [4,4,3] |
|2020-07-01|2020-12-31| 2| [2,1] |
+----------+----------+--------+---------------+
After exploding the sales_target, I am able to get the following results:
+----------+----------+--------+-------------+---------------+------------------+
|start_date| end_date |num_qtrs|sales_target |sales_target_n | sales_target_new |
+----------+----------+--------+-------------+---------------+------------------+
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 3 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 4 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 4 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 3 |
|2020-07-01|2020-12-31| 2| 3 | [2,1] | 2 |
|2020-07-01|2020-12-31| 2| 3 | [2,1] | 1 |
+----------+----------+--------+-------------+---------------+------------------+
I need help in adding different start/end dates for each row depending upon the num_qtrs value. I need to get a dataframe as below.
+----------+----------+--------+-------------+------------------+--------------+--------------+
|start_date| end_date |num_qtrs|sales_target | sales_target_new |new_start_date| new_end_date |
+----------+----------+--------+-------------+------------------+--------------+--------------+
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-01-01 |2020-03-31 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-04-01 |2020-06-30 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-07-01 |2020-09-30 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 3 |2020-10-01 |2020-12-31 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 4 |2020-04-01 |2020-06-30 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 4 |2020-07-01 |2020-09-30 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 3 |2020-10-01 |2020-12-31 |
|2020-07-01|2020-12-31| 2| [2,1] | 2 |2020-07-01 |2020-09-30 |
|2020-07-01|2020-12-31| 2| [2,1] | 1 |2020-10-01 |2020-12-31 |
+----------+----------+--------+-------------+------------------+--------------+--------------+
Can someone please help me with pyspark code sample to achieve the above desired results.
UPDATE ON SEQUENCE ERROR: Thanks
回答1:
Considering below to be your input dataframe after applying your UDF.
Input:
+----------+----------+--------+--------------+
|start_date| end_date|num_qtrs|sales_target_n|
+----------+----------+--------+--------------+
|2020-01-01|2020-12-31| 4| [4, 4, 4, 3]|
|2020-04-01|2020-12-31| 3| [4, 4, 3]|
|2020-07-01|2020-12-31| 2| [2, 1]|
+----------+----------+--------+--------------+
You can use a combination of row_number
, add_months
and date_add
to obtain your desired output, as shown below,
from pyspark.sql.functions import explode, row_number, expr
from pyspark.sql import Window
window = Window.partitionBy('start_date').orderBy(desc("sales_target_new"))
df.withColumn('sales_target_new', explode('sales_target_n')).\
withColumn('row_num', row_number().over(window)).\
withColumn('new_start_date', expr("add_months(start_date, (row_num-1) * 3)")).\
withColumn('new_end_date', expr("add_months(date_add(start_date, -1), row_num * 3)")).\
orderBy('start_date', 'row_num').show()
Output:
+----------+----------+--------+--------------+----------------+-------+--------------+------------+
|start_date| end_date|num_qtrs|sales_target_n|sales_target_new|row_num|new_start_date|new_end_date|
+----------+----------+--------+--------------+----------------+-------+--------------+------------+
|2020-01-01|2020-12-31| 4| [4, 4, 4, 3]| 4| 1| 2020-01-01| 2020-03-31|
|2020-01-01|2020-12-31| 4| [4, 4, 4, 3]| 4| 2| 2020-04-01| 2020-06-30|
|2020-01-01|2020-12-31| 4| [4, 4, 4, 3]| 4| 3| 2020-07-01| 2020-09-30|
|2020-01-01|2020-12-31| 4| [4, 4, 4, 3]| 3| 4| 2020-10-01| 2020-12-31|
|2020-04-01|2020-12-31| 3| [4, 4, 3]| 4| 1| 2020-04-01| 2020-06-30|
|2020-04-01|2020-12-31| 3| [4, 4, 3]| 4| 2| 2020-07-01| 2020-09-30|
|2020-04-01|2020-12-31| 3| [4, 4, 3]| 3| 3| 2020-10-01| 2020-12-31|
|2020-07-01|2020-12-31| 2| [2, 1]| 2| 1| 2020-07-01| 2020-09-30|
|2020-07-01|2020-12-31| 2| [2, 1]| 1| 2| 2020-10-01| 2020-12-31|
+----------+----------+--------+--------------+----------------+-------+--------------+------------+
You can modify the window
based on your requirements.
回答2:
Try this-
need start_date
and end_date
to calculate new_start_date
and new_end_date
Load the test data provided
val data =
"""
|start_date| end_date |sales_target
|2020-01-01|2020-12-31| 15
|2020-04-01|2020-12-31| 11
|2020-07-01|2020-12-31| 3
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +-------------------+-------------------+------------+
* |start_date |end_date |sales_target|
* +-------------------+-------------------+------------+
* |2020-01-01 00:00:00|2020-12-31 00:00:00|15 |
* |2020-04-01 00:00:00|2020-12-31 00:00:00|11 |
* |2020-07-01 00:00:00|2020-12-31 00:00:00|3 |
* +-------------------+-------------------+------------+
*
* root
* |-- start_date: timestamp (nullable = true)
* |-- end_date: timestamp (nullable = true)
* |-- sales_target: integer (nullable = true)
*/
Calculate new_start_date
and new_end_date
val processedDF = df.withColumn("new_start_date", explode(sequence(to_date($"start_date"), to_date($"end_date"),
expr("interval 3 month"))))
.withColumn("new_end_date",
date_sub(coalesce(lead("new_start_date", 1)
.over(Window.partitionBy("start_date").orderBy("new_start_date")), to_date($"end_date")), 1)
)
processedDF.orderBy("start_date", "new_start_date").show(false)
processedDF.printSchema()
/**
* +-------------------+-------------------+------------+--------------+------------+
* |start_date |end_date |sales_target|new_start_date|new_end_date|
* +-------------------+-------------------+------------+--------------+------------+
* |2020-01-01 00:00:00|2020-12-31 00:00:00|15 |2020-01-01 |2020-03-31 |
* |2020-01-01 00:00:00|2020-12-31 00:00:00|15 |2020-04-01 |2020-06-30 |
* |2020-01-01 00:00:00|2020-12-31 00:00:00|15 |2020-07-01 |2020-09-30 |
* |2020-01-01 00:00:00|2020-12-31 00:00:00|15 |2020-10-01 |2020-12-30 |
* |2020-04-01 00:00:00|2020-12-31 00:00:00|11 |2020-04-01 |2020-06-30 |
* |2020-04-01 00:00:00|2020-12-31 00:00:00|11 |2020-07-01 |2020-09-30 |
* |2020-04-01 00:00:00|2020-12-31 00:00:00|11 |2020-10-01 |2020-12-30 |
* |2020-07-01 00:00:00|2020-12-31 00:00:00|3 |2020-07-01 |2020-09-30 |
* |2020-07-01 00:00:00|2020-12-31 00:00:00|3 |2020-10-01 |2020-12-30 |
* +-------------------+-------------------+------------+--------------+------------+
*
* root
* |-- start_date: timestamp (nullable = true)
* |-- end_date: timestamp (nullable = true)
* |-- sales_target: integer (nullable = true)
* |-- new_start_date: date (nullable = false)
* |-- new_end_date: date (nullable = true)
*/
回答3:
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object Qvartal extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val dataDF = Seq(
("2020-01-01", "2020-12-31", 4, List(4,4,4,3)),
("2020-04-01", "2020-12-31", 3, List(4,4,3)),
("2020-07-01", "2020-12-31", 2, List(2,1))
).toDF("start_date", "end_date", "num_qtrs", "sales_target_n")
val listStartEnd = udf((b: String, e: String) => {
List("2020-01-01", "2020-04-01", "2020-07-01", "2020-10-01").filter(i => i >= b && i <= e)
.zip(List("2020-03-31", "2020-06-30", "2020-09-30", "2020-12-31").filter(i => i >= b && i <= e))
})
val resDF = dataDF.withColumn("new_start_end_date", lit(listStartEnd('start_date, 'end_date)))
resDF.show(false)
// +----------+----------+--------+--------------+--------------------------------------------------------------------------------------------------------+
// |start_date|end_date |num_qtrs|sales_target_n|new_start_end_date |
// +----------+----------+--------+--------------+--------------------------------------------------------------------------------------------------------+
// |2020-01-01|2020-12-31|4 |[4, 4, 4, 3] |[[2020-01-01, 2020-03-31], [2020-04-01, 2020-06-30], [2020-07-01, 2020-09-30], [2020-10-01, 2020-12-31]]|
// |2020-04-01|2020-12-31|3 |[4, 4, 3] |[[2020-04-01, 2020-06-30], [2020-07-01, 2020-09-30], [2020-10-01, 2020-12-31]] |
// |2020-07-01|2020-12-31|2 |[2, 1] |[[2020-07-01, 2020-09-30], [2020-10-01, 2020-12-31]] |
// +----------+----------+--------+--------------+--------------------------------------------------------------------------------------------------------+
val r11 = resDF
.withColumn("sales_target_n1", explode('sales_target_n))
.withColumn("monotonically_increasing_id", monotonically_increasing_id())
val r12 = r11.select(
'monotonically_increasing_id,
'start_date,
'end_date,
'num_qtrs,
'sales_target_n1
)
val r21 = resDF
.withColumn("new_start_end_date_1", explode('new_start_end_date))
.withColumn("monotonically_increasing_id", monotonically_increasing_id())
val r22 = r21.select(
'monotonically_increasing_id,
'start_date,
'end_date,
'num_qtrs,
'new_start_end_date_1
)
val resultDF = r12.join(r22,
r22.col("monotonically_increasing_id") === r12.col("monotonically_increasing_id"),
"inner")
.select(
r12.col("start_date"),
r12.col("end_date"),
r12.col("num_qtrs"),
r12.col("sales_target_n1").alias("sales_target_n"),
r22.col("new_start_end_date_1")
)
.withColumn("new_start_date", col("new_start_end_date_1").getItem("_1"))
.withColumn("new_end_date", col("new_start_end_date_1").getItem("_2"))
.drop("new_start_end_date_1")
resultDF.show(false)
// +----------+----------+--------+--------------+--------------+------------+
// |start_date|end_date |num_qtrs|sales_target_n|new_start_date|new_end_date|
// +----------+----------+--------+--------------+--------------+------------+
// |2020-01-01|2020-12-31|4 |4 |2020-01-01 |2020-03-31 |
// |2020-01-01|2020-12-31|4 |4 |2020-04-01 |2020-06-30 |
// |2020-01-01|2020-12-31|4 |4 |2020-07-01 |2020-09-30 |
// |2020-01-01|2020-12-31|4 |3 |2020-10-01 |2020-12-31 |
// |2020-04-01|2020-12-31|3 |4 |2020-04-01 |2020-06-30 |
// |2020-04-01|2020-12-31|3 |4 |2020-07-01 |2020-09-30 |
// |2020-04-01|2020-12-31|3 |3 |2020-10-01 |2020-12-31 |
// |2020-07-01|2020-12-31|2 |2 |2020-07-01 |2020-09-30 |
// |2020-07-01|2020-12-31|2 |1 |2020-10-01 |2020-12-31 |
// +----------+----------+--------+--------------+--------------+------------+
}
来源:https://stackoverflow.com/questions/62344243/pyspark-sql-add-different-qtr-start-date-end-date-for-exploded-rows