问题
Suppose I want to create 2 types of metric : metricA or metricB after transforming another dataset. If a certain condition is met, it'll generate both metricA and B, if condition is not met, generate only metric A. The idea is to write the 2 metrics to 2 different paths (pathA, pathB).
The approach I took was to create a Dataset of GeneralMetric and then based on whats inside, write to different paths, but obviously it didn't work as pattern matching inside Dataset wouldn't work
val s: SparkSession = SparkSession
.builder()
.appName("Metric")
.getOrCreate()
import s.implicits._
case class original (id : Int, units: List[Double])
case class MetricA (a: Int, b: Int, filtered_unit: List[Double])
case class MetricB (a: Int, filtered_unit: List[Double])
case class GeneralMetric(metricA: MetricA, metricB: Option[MetricB])
def createA: MetricA = {
MetricA(1, 1, List(1.0, 2.0)
}
def createB: MetricB = {
MetricB(1, List(10.0, 20.0)
}
def create (isBoth: Boolean): GeneralMetric = {
if(isBoth) {
val a: MetricA = createA()
val b: MetricB = createB()
GeneralMetric(a, Some(b))
}
else {
val a: MetricA = createA()
GeneralMetric(a, None)
}
}
val originalDF: DataFrame
val result : Dataset[GeneralMetric] =
originalDF.as[original]
.map { r =>
if(r.id == 21) create(true)
else create(false)
}
val pathA: String = "s3://pathA"
val pathB: String = "s3://pathB"
//below code obviously wouldn't work
result.map(x => {
case (metricA, Some(metricB)) => {
metricA.write.parquet(pathA)
metricB.write.parquet(pathB)
}
case (metricA, None) => metricA.write.parquet(pathA)
})
The next approach I was thinking of, was putting the results in a List[GeneralMetric], where GeneralMetric is a sealed trail, extended by both MetricA and MetricB, but how can I make a dataset transformation return a list of GeneralMetric.
Any ideas would be helpful
回答1:
Why wouldn't
result.map({
case (metricA, Some(metricB)) =>
metricA.write.parquet(pathA)
metricB.write.parquet(pathB)
case (metricA, None) => metricA.write.parquet(pathA)
})
work in your case? Is this just a syntax problem?
Also: it seems that you send metrics independently (or at least in this example). You could model it as:
sealed trait Metric {
def write
}
case class MetricA (a: Int, b: Int, filtered_unit: List[Double]) extends Metric {
override def write: Unit = ???
}
case class MetricB (a: Int, filtered_unit: List[Double]) extends Metric {
override def write: Unit = ???
}
and call
implicit val enc: Encoder[Metric] = Encoders.kryo[Metric]
val result: Dataset[Metric] =
originalDF.as[original]
.flatMap { r =>
if (r.id == 21) createA :: createB :: Nil
else createA :: Nil
}
result.foreach(metric.write.parquet())
来源:https://stackoverflow.com/questions/61022293/scala-spark-create-list-of-dataset-from-a-dataset-map-operation