Computing Pointwise Mutual Information in Spark

问题

I'm trying to compute pointwise mutual information (PMI).

I have two RDDs as defined here for p(x, y) and p(x) respectively:

pii: RDD[((String, String), Double)]
 pi: RDD[(String, Double)]

Any code I'm writing to compute PMI from the RDDs pii and pi is not pretty. My approach is first to flatten the RDD pii and join with pi twice while massaging the tuple elements.

val pmi = pii.map(x => (x._1._1, (x._1._2, x._1, x._2)))
             .join(pi).values
             .map(x => (x._1._1, (x._1._2, x._1._3, x._2)))
             .join(pi).values
             .map(x => (x._1._1, computePMI(x._1._2, x._1._3, x._2)))
// pmi: org.apache.spark.rdd.RDD[((String, String), Double)]
...
def computePMI(pab: Double, pa: Double, pb: Double) = {
  // handle boundary conditions, etc
  log(pab) - log(pa) - log(pb)
}

Clearly, this sucks. Is there a better (idiomatic) way to do this? Note: I could optimize the logs by storing the log-probs in pi and pii but choosing to write this way to keep the question clear.

回答1:

Using broadcast would be a solution.

val bcPi = pi.context.broadcast(pi.collectAsMap())
val pmi = pii.map {
  case ((x, y), pxy) =>
    (x, y) -> computePMI(pxy, bcPi.value.get(x).get, bcPi.value.get(y).get)
}

Assume: pi has all x and y in pii.

来源：https://stackoverflow.com/questions/29620297/computing-pointwise-mutual-information-in-spark

标签

apache-spark

apache-spark-mllib

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!