问题
I'm trying to compute pointwise mutual information (PMI).

I have two RDDs as defined here for p(x, y) and p(x) respectively:
pii: RDD[((String, String), Double)]
pi: RDD[(String, Double)]
Any code I'm writing to compute PMI from the RDDs pii
and pi
is not pretty. My approach is first to flatten the RDD pii
and join with pi
twice while massaging the tuple elements.
val pmi = pii.map(x => (x._1._1, (x._1._2, x._1, x._2)))
.join(pi).values
.map(x => (x._1._1, (x._1._2, x._1._3, x._2)))
.join(pi).values
.map(x => (x._1._1, computePMI(x._1._2, x._1._3, x._2)))
// pmi: org.apache.spark.rdd.RDD[((String, String), Double)]
...
def computePMI(pab: Double, pa: Double, pb: Double) = {
// handle boundary conditions, etc
log(pab) - log(pa) - log(pb)
}
Clearly, this sucks. Is there a better (idiomatic) way to do this?
Note: I could optimize the logs by storing the log-probs in pi
and pii
but choosing to write this way to keep the question clear.
回答1:
Using broadcast
would be a solution.
val bcPi = pi.context.broadcast(pi.collectAsMap())
val pmi = pii.map {
case ((x, y), pxy) =>
(x, y) -> computePMI(pxy, bcPi.value.get(x).get, bcPi.value.get(y).get)
}
Assume: pi
has all x
and y
in pii
.
来源:https://stackoverflow.com/questions/29620297/computing-pointwise-mutual-information-in-spark