Convert scala FP-growth RDD output to Data frame

为君一笑 提交于 2019-12-12 13:40:13

问题


https://spark.apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth

sample_fpgrowth.txt can be found here, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt

I ran the FP-growth example in the link above in scala its working fine, but what i need is, how to convert the result which is in RDD to data frame. Both these RDD

 model.freqItemsets and 
 model.generateAssociationRules(minConfidence)

explain that in detail with the example given in my question.


回答1:


There many ways to create a dataframe once you have a rdd. One of them is to use .toDF function which requires sqlContext.implicits library to be imported as

val sparkSession = SparkSession.builder().appName("udf testings")
  .master("local")
  .config("", "")
  .getOrCreate()
val sc = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._

After that you read the fpgrowth text file and covert into an rdd

    val data = sc.textFile("path to sample_fpgrowth.txt that you have used")
    val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))

I have used the code from Frequent Pattern Mining - RDD-based API that is provided in the question

val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(10)
val model = fpg.run(transactions)

Next step would be to call .toDF functions

For the first dataframe

model.freqItemsets.map(itemset =>(itemset.items.mkString("[", ",", "]") , itemset.freq)).toDF("items", "freq").show(false)

this will result to

+---------+----+
|items    |freq|
+---------+----+
|[z]      |5   |
|[x]      |4   |
|[x,z]    |3   |
|[y]      |3   |
|[y,x]    |3   |
|[y,x,z]  |3   |
|[y,z]    |3   |
|[r]      |3   |
|[r,x]    |2   |
|[r,z]    |2   |
|[s]      |3   |
|[s,y]    |2   |
|[s,y,x]  |2   |
|[s,y,x,z]|2   |
|[s,y,z]  |2   |
|[s,x]    |3   |
|[s,x,z]  |2   |
|[s,z]    |2   |
|[t]      |3   |
|[t,y]    |3   |
+---------+----+
only showing top 20 rows

for the second dataframe

val minConfidence = 0.8
model.generateAssociationRules(minConfidence)
  .map(rule =>(rule.antecedent.mkString("[", ",", "]"), rule.consequent.mkString("[", ",", "]"), rule.confidence))
  .toDF("antecedent", "consequent", "confidence").show(false)

which will result to

+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
|[t,s,y]   |[x]       |1.0       |
|[t,s,y]   |[z]       |1.0       |
|[y,x,z]   |[t]       |1.0       |
|[y]       |[x]       |1.0       |
|[y]       |[z]       |1.0       |
|[y]       |[t]       |1.0       |
|[p]       |[r]       |1.0       |
|[p]       |[z]       |1.0       |
|[q,t,z]   |[y]       |1.0       |
|[q,t,z]   |[x]       |1.0       |
|[q,y]     |[x]       |1.0       |
|[q,y]     |[z]       |1.0       |
|[q,y]     |[t]       |1.0       |
|[t,s,x]   |[y]       |1.0       |
|[t,s,x]   |[z]       |1.0       |
|[q,t,y,z] |[x]       |1.0       |
|[q,t,x,z] |[y]       |1.0       |
|[q,x]     |[y]       |1.0       |
|[q,x]     |[t]       |1.0       |
|[q,x]     |[z]       |1.0       |
+----------+----------+----------+
only showing top 20 rows

I hope this is what you require



来源:https://stackoverflow.com/questions/44262627/convert-scala-fp-growth-rdd-output-to-data-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!