问题
I need to get avg and count for given dataframe and need to get previously stored avg and count from Cassandra table values for each company.
Then need to calculate avg and count and persist back into the Cassandra table.
How can I do it for each company ?
I have two dataframe schemas as below
ingested_df
|-- company_id: string (nullable = true)
|-- max_dd: date (nullable = true)
|-- min_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
cassandra_df
|-- company_id: string (nullable = true)
|-- max_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
For each company_id i need to get stored "mean" & "count" and calculate "new_mean" & "new_count" and store back in cassandra ...
i.e.
new_mean = ( ingested_df.mean + cassandra_df.mean) / (ingested_df.count + cassandra_df.count)
new_count = (ingested_df.count + cassandra_df.count)
How can it be done for each company?
Second Time :
When i tried below join for the same logic above mentioned
val resultDf = cassandra_df.join(ingested_df ,
( cassandra_df("company_id") === ingested_df ("company_id") )
( ingested_df ("min_dd") > cassandra_df("max_dd") )
, "left")
This is throwing error as below : org.apache.spark.sql.AnalysisException: Reference 'cassandra_df' is ambiguous, could be: company_id, company_id.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
what is wrong here ?
回答1:
Please try the following approach:
import spark.implicits._
val ingested_df = Seq(("1", "10", "3")).toDF("company_id", "mean", "count")
val cassandra_df = Seq(("1", "123123", "20", "10")).toDF("company_id", "max_dd", "mean", "count")
val preparedIngestedDf = ingested_df.select("company_id", "mean", "count")
val resultDf = cassandra_df.join(preparedIngestedDf, Seq("company_id"), "left")
.withColumn("new_mean", (ingested_df("mean") + cassandra_df("mean")) / (ingested_df("count") + cassandra_df("count")))
.withColumn("new_count", ingested_df("count") + cassandra_df("count"))
.select(
col("company_id"),
col("max_dd"),
col("new_mean").as("mean"),
col("new_count").as("new_count")
)
来源:https://stackoverflow.com/questions/55473276/how-to-perform-accumulated-avg-for-multiple-companies-using-spark-based-on-the-r