how to efficiently parse dataframe object into a map of key-value pairs

问题

i'm working with a dataframe with the columns basketID and itemID. is there a way to efficiently parse through the dataset and generate a map where the keys are basketID and the value is a set of all the itemID contained within each basket?

my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks!

screen shot of sample data

the goal is to obtain basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6")). heres the implementation I have using a for loop

// create empty container
val basket = scala.collection.mutable.Map[String, Set[String]]()
// loop over all numerical indexes for baskets (b<i>)
for (i <- 1 to 4) {
  basket("b" + i.toString) = Set();
}
// loop over every row in df and store the items to the set
df.collect().foreach(row => 
  basket(row(0).toString) += row(1).toString
)

回答1:

You can simply do aggregateByKey operation then collectItAsMap will directly give you the desired result. It is much more efficient than simple groupBy.

import scala.collection.mutable
case class Items(basketID: String,itemID: String)
 
 import spark.implicits._
 val result = output.as[Items].rdd.map(x => (x.basketID,x.itemID))
.aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())
 ((l: mutable.Buffer[String], p: String) => l += p , 
 (l1: mutable.Buffer[String], l2: mutable.Buffer[String]) => (l1 ++ l2).distinct)
.collectAsMap();

you can check other aggregation api's like reduceBy and groupBy over here. please also check aggregateByKey vs groupByKey vs ReduceByKey differences.

回答2:

This is efficient assuming your dataset is small enough to fit into the driver's memory. .collect will give you an array of rows on which you are iterating which is fine. If you want scalability then instead of Map[String, Set[String]] (this will reside in driver memory) you can use PairRDD[String, Set[String]] (this will be distributed).

//NOT TESTED

//Assuming df is dataframe with 2 columns, first is your basketId and second is itemId

df.rdd.map(row => (row.getAs[String](0), row.getAs[String](1)).groupByKey().mapValues(x => x.toSet)

来源：https://stackoverflow.com/questions/63677703/how-to-efficiently-parse-dataframe-object-into-a-map-of-key-value-pairs

标签

apache-spark

apache-spark-sql