How to perform accumulated avg for multiple companies using spark based on the results stored in Cassandra?

问题

I need to get avg and count for given dataframe and need to get previously stored avg and count from Cassandra table values for each company.

Then need to calculate avg and count and persist back into the Cassandra table.

How can I do it for each company ?

I have two dataframe schemas as below

ingested_df
 |-- company_id: string (nullable = true)
 |-- max_dd: date (nullable = true)
 |-- min_dd: date (nullable = true)
 |-- mean: double (nullable = true)
 |-- count: long (nullable = false)

cassandra_df 
 |-- company_id: string (nullable = true)
 |-- max_dd: date (nullable = true)
 |-- mean: double (nullable = true)
 |-- count: long (nullable = false)

For each company_id i need to get stored "mean" & "count" and calculate "new_mean" & "new_count" and store back in cassandra ...

i.e.

    new_mean = ( ingested_df.mean  + cassandra_df.mean) / (ingested_df.count + cassandra_df.count)

   new_count  = (ingested_df.count + cassandra_df.count)

How can it be done for each company?

Second Time :

When i tried below join for the same logic above mentioned

 val resultDf = cassandra_df.join(ingested_df , 
                            ( cassandra_df("company_id") === ingested_df ("company_id") )
                            ( ingested_df ("min_dd") > cassandra_df("max_dd") )
                        , "left")

This is throwing error as below : org.apache.spark.sql.AnalysisException: Reference 'cassandra_df' is ambiguous, could be: company_id, company_id.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)

what is wrong here ?

回答1:

Please try the following approach:

import spark.implicits._

val ingested_df = Seq(("1", "10", "3")).toDF("company_id", "mean", "count")
val cassandra_df = Seq(("1", "123123", "20", "10")).toDF("company_id", "max_dd", "mean", "count")

val preparedIngestedDf = ingested_df.select("company_id", "mean", "count")

val resultDf = cassandra_df.join(preparedIngestedDf, Seq("company_id"), "left")
  .withColumn("new_mean", (ingested_df("mean") + cassandra_df("mean")) / (ingested_df("count") + cassandra_df("count")))
  .withColumn("new_count", ingested_df("count") + cassandra_df("count"))
  .select(
    col("company_id"),
    col("max_dd"),
    col("new_mean").as("mean"),
    col("new_count").as("new_count")
  )

来源：https://stackoverflow.com/questions/55473276/how-to-perform-accumulated-avg-for-multiple-companies-using-spark-based-on-the-r

标签

apache-spark

apache-spark-sql

datastax

databricks