Quantity redistribution logic - MapGroups with external dataset

问题

I am working on a complex logic where I need to redistribute a quantity from one dataset to another dataset.

In the example we have Owner and Invoice - We need to subtract the quantity from the Invoice to the exact Owner match (at a given postal code for a given car). The subtracted quantity needs to be redistributed back to the other postal code where the same car appears. The complexity happens where we should avoid distributing to postal code where the same car is present in the Invoice table for another pcode.

Finally, in case the subtraction or the re-distribution produces a negative value, we should avoid this transformation for the given Invoice.

Here is an example with numbers

Below is the code version but unfortunately it doesn't work as expected. More specifically I don't know how to skip the records that are present multiple times in the Invoice for a given car. In the first example (red), I don't know how to skip the record Owner(A, 888, 100).

package playground

import org.apache.spark.sql.SparkSession


object basic extends App {
  val spark = SparkSession
    .builder()
    .appName("Sample app")
    .master("local")
    .getOrCreate()

  import spark.implicits._

  final case class Owner(car: String, pcode: String, qtty: Double)
  final case class Invoice(car: String, pcode: String, qtty: Double)

  val sc = spark.sparkContext

  val data = Seq(
    Owner("A", "666", 80),
    Owner("B", "555", 20),
    Owner("A", "444", 50),
    Owner("A", "222", 20),
    Owner("C", "444", 20),
    Owner("C", "666", 80),
    Owner("C", "555", 120),
    Owner("A", "888", 100)
  )

  val fleet = Seq(
    Invoice("A", "666", 15),
    Invoice("C", "444", 10),
    Invoice("A", "888", 12),
    Invoice("B", "555", 200)
  )

  val owners = spark.createDataset(data)
  val invoices = spark.createDataset(fleet)

  val actual = owners
    .joinWith(invoices, owners("Car") === invoices("Car"), joinType = "right")
    .groupByKey(_._2)
    .flatMapGroups {
      case (invoice, group) =>
        val subOwner: Vector[Owner] = group.toVector.map(_._1)
        val householdToBeInvoiced: Vector[Owner] =
          subOwner.filter(_.pcode == invoice.pcode)
        val modifiedOwner: Vector[Owner] = if (householdToBeInvoiced.nonEmpty) {
          // negative compensation (remove the quantity from Invoice for the exact match)
          val neg: Owner = householdToBeInvoiced.head
          val calculatedNeg: Owner = neg.copy(qtty = neg.qtty - invoice.qtty)

          // positive compensation (redistribute the "removed" quantity proportionally but not for pcode existing in
          // invoice for the same car
          val pos = subOwner.filter(s => s.pcode != invoice.pcode)
          val totalQuantityOwner = pos.map(_.qtty).sum
          val calculatedPos: Vector[Owner] =
            pos.map(
              c =>
                c.copy(
                  qtty = c.qtty + invoice.qtty * c.qtty / (totalQuantityOwner - neg.qtty)
              )
            )

          (calculatedPos :+ calculatedNeg)
        } else {
          subOwner
        }

        modifiedOwner
    }
}

This code produce

+---+-----+------------------+
|car|pcode|              qtty|
+---+-----+------------------+
|  A|  888|116.66666666666667|
|  A|  222|23.333333333333332|
|  A|  444|58.333333333333336|
|  A|  666|              65.0|
|  C|  555|126.66666666666667|
|  C|  666| 84.44444444444444|
|  C|  444|              10.0|
|  B|  555|            -180.0|
|  A|  222|              24.8|
|  A|  444|              62.0|
|  A|  666|              99.2|
|  A|  888|              88.0|
+---+-----+------------------+

Any support will be much appreciated! Thanks

After some more thought on this problem, I managed to improve the code but I still cannot get the iterative approach in place (use the previous computation to compute the next one, e.g. get the result of the red record to produce the blue record etc.)

package playground

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{Dataset, KeyValueGroupedDataset, SparkSession}

object basic extends App {

  Logger.getLogger("org").setLevel(Level.OFF)
  Logger.getLogger("akka").setLevel(Level.OFF)

  val spark = SparkSession
    .builder()
    .appName("Spark Optimization Playground")
    .master("local")
    .getOrCreate()

  import spark.implicits._

  final case class Owner(car: String, pcode: String, qtty: Double)
  final case class Invoice(car: String, pcode: String, qtty: Double)

  val data = Seq(
    Owner("A", "666", 80),
    Owner("B", "555", 20),
    Owner("A", "444", 50),
    Owner("A", "222", 20),
    Owner("C", "444", 20),
    Owner("C", "666", 80),
    Owner("C", "555", 120),
    Owner("A", "888", 100)
  )

  val fleet = Seq(
    Invoice("A", "666", 15),
    Invoice("C", "444", 10),
    Invoice("A", "888", 12),
    Invoice("B", "555", 200)
  )

  val owners = spark.createDataset(data)
  val invoices = spark.createDataset(fleet)

  val secondFleets = invoices.map(identity)

  val fleetPerCar =
    invoices
      .joinWith(secondFleets, invoices("car") === secondFleets("car"), "inner")
      .groupByKey(_._1)
      .flatMapGroups {
        case (value, iter) ⇒ Iterator((value, iter.toArray))
      }

  val gb
    : KeyValueGroupedDataset[(Invoice, Array[(Invoice, Invoice)]),
                             (Owner, (Invoice, Array[(Invoice, Invoice)]))] =
    owners
      .joinWith(fleetPerCar, owners("car") === fleetPerCar("_1.car"), "right")
      .groupByKey(_._2)

  val x: Dataset[Owner] =
    gb.flatMapGroups {
      case (fleet, group) =>
        val subOwner: Vector[Owner] = group.toVector.map(_._1)
        val householdToBeInvoiced: Vector[Owner] =
          subOwner.filter(_.pcode == fleet._1.pcode)
        val modifiedOwner: Vector[Owner] = if (householdToBeInvoiced.nonEmpty) {
          // negative compensation (remove the quantity from Invoice for the exact match)
          val neg: Owner = householdToBeInvoiced.head
          val calculatedNeg: Owner = neg.copy(qtty = neg.qtty - fleet._1.qtty)

          // positive compensation (redistribute the "removed" quantity proportionally but not for pcode existing in
          // invoice for the same car
          val otherPCode =
            fleet._2.filter(_._2.pcode != fleet._1.pcode).map(_._2.pcode)

          val pos = subOwner.filter(
            s => s.pcode != fleet._1.pcode && !otherPCode.contains(s.pcode)
          )
          val totalQuantityOwner = pos.map(_.qtty).sum + neg.qtty
          val calculatedPos: Vector[Owner] =
            pos.map(
              c =>
                c.copy(
                  qtty = c.qtty + fleet._1.qtty * c.qtty / (totalQuantityOwner - neg.qtty)
              )
            )
          // if pos or neg compensation produce negative quantity, skip the computation
          val res = (calculatedPos :+ calculatedNeg)
          if (res.exists(_.qtty < 0)) {
            subOwner
          } else {
            res
          }
        } else {
          subOwner
        }

        modifiedOwner
    }
  x.show()
}

回答1:

The first solution is based on Spark Datasets and SparkSQL and provides the expected results.

There are many ways to configure this approach, even taking into account performance issues, which may be discuss later on.

import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}

object basic {

  val spark = SparkSession
    .builder()
    .appName("Sample app")
    .master("local")
    .config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
    .getOrCreate()

  val sc = spark.sparkContext

  case class Owner(car: String, pcode: String, qtty: Double)
  case class Invoice(car: String, pcode: String, qtty: Double)

  def main(args: Array[String]): Unit = {

    val data = Seq(
      Owner("A", "666", 80),
      Owner("B", "555", 20),
      Owner("A", "444", 50),
      Owner("A", "222", 20),
      Owner("C", "444", 20),
      Owner("C", "666", 80),
      Owner("C", "555", 120),
      Owner("A", "888", 100)
    )

    val fleet = Seq(
      Invoice("A", "666", 15),
      Invoice("C", "666", 10),
      Invoice("A", "888", 12),
      Invoice("B", "555", 200)
    )

    val expected = Seq(
      Owner("A", "666", 65),
      Owner("B", "555", 20), // not redistributed because produce a negative value
      Owner("A", "444", 69.29),
      Owner("A", "222", 27.71),
      Owner("C", "444", 21.43),
      Owner("C", "666", 70),
      Owner("C", "555", 128.57),
      Owner("A", "888", 88)
    )

    Logger.getRootLogger.setLevel(Level.ERROR)

    try {
      import spark.implicits._

      val owners = spark.createDataset(data).as[Owner].cache()
      val invoices = spark.createDataset(fleet).as[Invoice].cache()

      owners.createOrReplaceTempView("owners")
      invoices.createOrReplaceTempView("invoices")

      /**
        * this part fetch car and pcode from owner with the substracted quantity from invoice
        */
      val p1 = spark.sql(
        """SELECT i.car,i.pcode,
          |CASE WHEN (o.qtty - i.qtty) < 0 THEN o.qtty ELSE (o.qtty - i.qtty) END AS qtty,
          |CASE WHEN (o.qtty - i.qtty) < 0 THEN 0 ELSE i.qtty END AS to_distribute
          |FROM owners o
          |INNER JOIN invoices i  ON(i.car = o.car AND i.pcode = o.pcode)
          |""".stripMargin)
        .cache()
      p1.createOrReplaceTempView("p1")

      /**
        * this part fetch all the car and pcode that we have to redistribute their quantity
        */
      val p2 = spark.sql(
        """SELECT o.car, o.pcode, o.qtty
          |FROM owners o
          |LEFT OUTER JOIN invoices i  ON(i.car = o.car AND i.pcode = o.pcode)
          |WHERE i.car IS NULL
          |""".stripMargin)
        .cache()
      p2.createOrReplaceTempView("p2")

      /**
        * this part fetch the quantity to distribute
        */
      val distribute = spark.sql(
        """
          |SELECT car, SUM(to_distribute) AS to_distribute
          |FROM p1
          |GROUP BY car
          |""".stripMargin)
        .cache()
      distribute.createOrReplaceTempView("distribute")

      /**
        * this part fetch the proportion to distribute proportionally
        */
      val proportion = spark.sql(
        """
          |SELECT car, SUM(qtty) AS proportion
          |FROM p2
          |GROUP BY car
          |""".stripMargin)
          .cache()
      proportion.createOrReplaceTempView("proportion")


      /**
        * this part join p1 and p2 with the distribution calculated
        */
      val result = spark.sql(
        """
          |SELECT p2.car, p2.pcode, ROUND(((to_distribute / proportion) * qtty) + qtty, 2) AS qtty
          |FROM p2
          |JOIN distribute d ON(p2.car = d.car)
          |JOIN proportion p ON(d.car = p.car)
          |UNION ALL
          |SELECT car, pcode, qtty
          |FROM p1
          |""".stripMargin)

      result.show(truncate = false)
/*
+---+-----+------+
|car|pcode|qtty  |
+---+-----+------+
|A  |444  |69.29 |
|A  |222  |27.71 |
|C  |444  |21.43 |
|C  |555  |128.57|
|A  |666  |65.0  |
|B  |555  |20.0  |
|C  |666  |70.0  |
|A  |888  |88.0  |
+---+-----+------+
*/

      expected
        .toDF("car","pcode","qtty")
        .show(truncate = false)
/*
+---+-----+------+
|car|pcode|qtty  |
+---+-----+------+
|A  |666  |65.0  |
|B  |555  |20.0  |
|A  |444  |69.29 |
|A  |222  |27.71 |
|C  |444  |21.43 |
|C  |666  |70.0  |
|C  |555  |128.57|
|A  |888  |88.0  |
+---+-----+------+
*/

    } finally {
      sc.stop()
      println("SparkContext stopped")
      spark.stop()
      println("SparkSession stopped")
    }
  }
}

USING API DATASET

Another approach for this problem with the same results would be to use Datasets and its great API, as an example of this:

import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel

object basic2 {

  val spark = SparkSession
    .builder()
    .appName("Sample app")
    .master("local")
    .config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
    .getOrCreate()

  val sc = spark.sparkContext

  final case class Owner(car: String, pcode: String, o_qtty: Double)
  final case class Invoice(car: String, pcode: String, i_qtty: Double)

  def main(args: Array[String]): Unit = {

    val data = Seq(
      Owner("A", "666", 80),
      Owner("B", "555", 20),
      Owner("A", "444", 50),
      Owner("A", "222", 20),
      Owner("C", "444", 20),
      Owner("C", "666", 80),
      Owner("C", "555", 120),
      Owner("A", "888", 100)
    )

    val fleet = Seq(
      Invoice("A", "666", 15),
      Invoice("C", "666", 10),
      Invoice("A", "888", 12),
      Invoice("B", "555", 200)
    )

    val expected = Seq(
      Owner("A", "666", 65),
      Owner("B", "555", 20), // not redistributed because produce a negative value
      Owner("A", "444", 69.29),
      Owner("A", "222", 27.71),
      Owner("C", "444", 21.43),
      Owner("C", "666", 70),
      Owner("C", "555", 128.57),
      Owner("A", "888", 88)
    )

    Logger.getRootLogger.setLevel(Level.ERROR)

    try {
      import spark.implicits._

      val owners = spark.createDataset(data)
        .as[Owner]
        .cache()

      val invoices = spark.createDataset(fleet)
        .as[Invoice]
        .cache()

      val p1 = owners
        .join(invoices,Seq("car","pcode"),"inner")
        .selectExpr("car","pcode","IF(o_qtty-i_qtty < 0,o_qtty,o_qtty - i_qtty) AS qtty","IF(o_qtty-i_qtty < 0,0,i_qtty) AS to_distribute")
        .persist(StorageLevel.MEMORY_ONLY)

      val p2 = owners
        .join(invoices,Seq("car","pcode"),"left_outer")
        .filter(row => row.anyNull == true)
        .drop(col("i_qtty"))
        .withColumnRenamed("o_qtty","qtty")
        .persist(StorageLevel.MEMORY_ONLY)

      val distribute = p1
        .groupBy(col("car"))
        .agg(sum(col("to_distribute")).as("to_distribute"))
        .persist(StorageLevel.MEMORY_ONLY)

      val proportion = p2
          .groupBy(col("car"))
          .agg(sum(col("qtty")).as("proportion"))
          .persist(StorageLevel.MEMORY_ONLY)

      val result = p2
        .join(distribute, "car")
        .join(proportion, "car")
        .withColumn("qtty",round( ((col("to_distribute") / col("proportion")) * col("qtty")) + col("qtty"), 2 ))
        .drop("to_distribute","proportion")
        .union(p1.drop("to_distribute"))

      result.show()
/*
+---+-----+------+
|car|pcode|  qtty|
+---+-----+------+
|  A|  444| 69.29|
|  A|  222| 27.71|
|  C|  444| 21.43|
|  C|  555|128.57|
|  A|  666|  65.0|
|  B|  555|  20.0|
|  C|  666|  70.0|
|  A|  888|  88.0|
+---+-----+------+
*/

      expected
        .toDF("car","pcode","qtty")
        .show(truncate = false)
/*
+---+-----+------+
|car|pcode|qtty  |
+---+-----+------+
|A  |666  |65.0  |
|B  |555  |20.0  |
|A  |444  |69.29 |
|A  |222  |27.71 |
|C  |444  |21.43 |
|C  |666  |70.0  |
|C  |555  |128.57|
|A  |888  |88.0  |
+---+-----+------+
*/

    } finally {
      sc.stop()
      println("SparkContext stopped")
      spark.stop()
      println("SparkSession stopped")
    }
  }
}

Some general considerations about performance and tuning.

It always depends of your particular use case but in general, first, if you can filter and clean the data, you could see some improvement.

A whole point of using a high level declarative API is to isolate yourself from a low level implementation details. Optimization is a job of the Catalyst Optimizer. It is a sophisticated engine and I really doubt someone can easily improve on that without diving much deeper into its internals.

Default Number of Partitions Property: spark.sql.shuffle.partitions , Set It Up Properly.

By default Spark SQL uses spark.sql.shuffle.partitions number of partitions for aggregations and joins, i.e. 200 by default. That often leads to explosion of partitions for nothing that does impact the performance of a query since these 200 tasks (per partition) have all to start and finish before you get the result.

Think how many partitions your query really requires.

Spark can only run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions. As far as choosing a "good" number of partitions, you generally want at least as many as the number of executors for parallelism. You can get this computed value by calling

sc.defaultParallelism

or inspect RDD Partitions number by

df.rdd.partitions.size

Repartition: increase partitions, rebalancing partitions after filter increase paralellism repartition(numPartitions: Int)

Coalesce: decrease partitions WITHOUT shuffle consolidate before outputting to HDFS/external coalesce(numPartitions: Int, suffle: Boolean = false)

You can follow this link: Managing Spark Partitions with Coalesce and Repartition

Cache the data to avoid recomputation: dataFrame.cache()

Analyzer — Logical Query Plan Analyzer

Analyzer is the logical query plan analyzer in Spark SQL that semantically validates and transforms an unresolved logical plan to an analyzed logical plan.

You can access the analyzed logical plan of a Dataset using explain (with extended flag enabled)

dataframe.explain(extended = true)

For further performance options see the documentation: Performance Tuning

There are a lot of possiblities for tunning Spark processes, but it always depends of your use case.

Batch or Streaming process? Dataframes or plain RDD's? Hive or not Hive? Shuffled data or not?, etc...

I strongly recommend you The Internals of Spark SQL by Jacek Laskowski.

Finally, you will have to do some trials with different values and benchmark to see how many time is taking the process with a data sample.

  val start = System.nanoTime()

  // my process

  val end = System.nanoTime()

  val time = end - start
  println(s"My App takes: $time")

Hope this helps.

来源：https://stackoverflow.com/questions/62708493/quantity-redistribution-logic-mapgroups-with-external-dataset

标签

algorithm

scala

apache-spark

functional-programming