问题
https://spark.apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth
sample_fpgrowth.txt can be found here, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt
I ran the FP-growth example in the link above in scala its working fine, but what i need is, how to convert the result which is in RDD to data frame. Both these RDD
model.freqItemsets and
model.generateAssociationRules(minConfidence)
explain that in detail with the example given in my question.
回答1:
There many ways to create a dataframe
once you have a rdd
. One of them is to use .toDF
function which requires sqlContext.implicits
library to be imported
as
val sparkSession = SparkSession.builder().appName("udf testings")
.master("local")
.config("", "")
.getOrCreate()
val sc = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
After that you read the fpgrowth
text file and covert into an rdd
val data = sc.textFile("path to sample_fpgrowth.txt that you have used")
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
I have used the code from Frequent Pattern Mining - RDD-based API that is provided in the question
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions)
Next step would be to call .toDF
functions
For the first dataframe
model.freqItemsets.map(itemset =>(itemset.items.mkString("[", ",", "]") , itemset.freq)).toDF("items", "freq").show(false)
this will result to
+---------+----+
|items |freq|
+---------+----+
|[z] |5 |
|[x] |4 |
|[x,z] |3 |
|[y] |3 |
|[y,x] |3 |
|[y,x,z] |3 |
|[y,z] |3 |
|[r] |3 |
|[r,x] |2 |
|[r,z] |2 |
|[s] |3 |
|[s,y] |2 |
|[s,y,x] |2 |
|[s,y,x,z]|2 |
|[s,y,z] |2 |
|[s,x] |3 |
|[s,x,z] |2 |
|[s,z] |2 |
|[t] |3 |
|[t,y] |3 |
+---------+----+
only showing top 20 rows
for the second dataframe
val minConfidence = 0.8
model.generateAssociationRules(minConfidence)
.map(rule =>(rule.antecedent.mkString("[", ",", "]"), rule.consequent.mkString("[", ",", "]"), rule.confidence))
.toDF("antecedent", "consequent", "confidence").show(false)
which will result to
+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
|[t,s,y] |[x] |1.0 |
|[t,s,y] |[z] |1.0 |
|[y,x,z] |[t] |1.0 |
|[y] |[x] |1.0 |
|[y] |[z] |1.0 |
|[y] |[t] |1.0 |
|[p] |[r] |1.0 |
|[p] |[z] |1.0 |
|[q,t,z] |[y] |1.0 |
|[q,t,z] |[x] |1.0 |
|[q,y] |[x] |1.0 |
|[q,y] |[z] |1.0 |
|[q,y] |[t] |1.0 |
|[t,s,x] |[y] |1.0 |
|[t,s,x] |[z] |1.0 |
|[q,t,y,z] |[x] |1.0 |
|[q,t,x,z] |[y] |1.0 |
|[q,x] |[y] |1.0 |
|[q,x] |[t] |1.0 |
|[q,x] |[z] |1.0 |
+----------+----------+----------+
only showing top 20 rows
I hope this is what you require
来源:https://stackoverflow.com/questions/44262627/convert-scala-fp-growth-rdd-output-to-data-frame