Convert scala FP-growth RDD output to Data frame

问题

https://spark.apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth

sample_fpgrowth.txt can be found here, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt

I ran the FP-growth example in the link above in scala its working fine, but what i need is, how to convert the result which is in RDD to data frame. Both these RDD

 model.freqItemsets and 
 model.generateAssociationRules(minConfidence)

explain that in detail with the example given in my question.

回答1:

There many ways to create a dataframe once you have a rdd. One of them is to use .toDF function which requires sqlContext.implicits library to be imported as

val sparkSession = SparkSession.builder().appName("udf testings")
  .master("local")
  .config("", "")
  .getOrCreate()
val sc = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._

After that you read the fpgrowth text file and covert into an rdd

    val data = sc.textFile("path to sample_fpgrowth.txt that you have used")
    val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))

I have used the code from Frequent Pattern Mining - RDD-based API that is provided in the question

val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(10)
val model = fpg.run(transactions)

Next step would be to call .toDF functions

For the first dataframe

model.freqItemsets.map(itemset =>(itemset.items.mkString("[", ",", "]") , itemset.freq)).toDF("items", "freq").show(false)

this will result to

+---------+----+
|items    |freq|
+---------+----+
|[z]      |5   |
|[x]      |4   |
|[x,z]    |3   |
|[y]      |3   |
|[y,x]    |3   |
|[y,x,z]  |3   |
|[y,z]    |3   |
|[r]      |3   |
|[r,x]    |2   |
|[r,z]    |2   |
|[s]      |3   |
|[s,y]    |2   |
|[s,y,x]  |2   |
|[s,y,x,z]|2   |
|[s,y,z]  |2   |
|[s,x]    |3   |
|[s,x,z]  |2   |
|[s,z]    |2   |
|[t]      |3   |
|[t,y]    |3   |
+---------+----+
only showing top 20 rows

for the second dataframe

val minConfidence = 0.8
model.generateAssociationRules(minConfidence)
  .map(rule =>(rule.antecedent.mkString("[", ",", "]"), rule.consequent.mkString("[", ",", "]"), rule.confidence))
  .toDF("antecedent", "consequent", "confidence").show(false)

which will result to

+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
|[t,s,y]   |[x]       |1.0       |
|[t,s,y]   |[z]       |1.0       |
|[y,x,z]   |[t]       |1.0       |
|[y]       |[x]       |1.0       |
|[y]       |[z]       |1.0       |
|[y]       |[t]       |1.0       |
|[p]       |[r]       |1.0       |
|[p]       |[z]       |1.0       |
|[q,t,z]   |[y]       |1.0       |
|[q,t,z]   |[x]       |1.0       |
|[q,y]     |[x]       |1.0       |
|[q,y]     |[z]       |1.0       |
|[q,y]     |[t]       |1.0       |
|[t,s,x]   |[y]       |1.0       |
|[t,s,x]   |[z]       |1.0       |
|[q,t,y,z] |[x]       |1.0       |
|[q,t,x,z] |[y]       |1.0       |
|[q,x]     |[y]       |1.0       |
|[q,x]     |[t]       |1.0       |
|[q,x]     |[z]       |1.0       |
+----------+----------+----------+
only showing top 20 rows

I hope this is what you require

来源：https://stackoverflow.com/questions/44262627/convert-scala-fp-growth-rdd-output-to-data-frame

标签

scala

apache-spark

apache-spark-mllib