Spark, Scala, DataFrame: create feature vectors

后端 未结 3 1160
無奈伤痛
無奈伤痛 2020-12-13 01:12

I have a DataFrame that looks like follow:

userID, category, frequency
1,cat1,1
1,cat2,3
1,cat9,5
2,c         


        
3条回答
  •  自闭症患者
    2020-12-13 01:48

    Given your input:

    val df = Seq((1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), 
                 (2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1), 
                 (3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))
               .toDF("userID", "category", "frequency")
    df.show
    +------+--------+---------+
    |userID|category|frequency|
    +------+--------+---------+
    |     1|    cat1|        1|
    |     1|    cat2|        3|
    |     1|    cat9|        5|
    |     2|    cat4|        6|
    |     2|    cat9|        2|
    |     2|   cat10|        1|
    |     3|    cat1|        5|
    |     3|    cat7|       16|
    |     3|    cat8|        2|
    +------+--------+---------+
    

    Just run:

    val pivoted = df.groupBy("userID").pivot("category").avg("frequency")
    val dfZeros = pivoted.na.fill(0)
    dzZeros.show    
    +------+----+-----+----+----+----+----+----+                                    
    |userID|cat1|cat10|cat2|cat4|cat7|cat8|cat9|
    +------+----+-----+----+----+----+----+----+
    |     1| 1.0|  0.0| 3.0| 0.0| 0.0| 0.0| 5.0|
    |     3| 5.0|  0.0| 0.0| 0.0|16.0| 2.0| 0.0|
    |     2| 0.0|  1.0| 0.0| 6.0| 0.0| 0.0| 2.0|
    +------+----+-----+----+----+----+----+----+
    

    Finally, use VectorAssembler to create a org.apache.spark.ml.linalg.Vector

    NOTE: I have not checked performances on this yet...

    EDIT: Possibly more complex, but likely more efficient!

    def toSparseVectorUdf(size: Int) = udf[Vector, Seq[Row]] {
      (data: Seq[Row]) => {
        val indices = data.map(_.getDouble(0).toInt).toArray
        val values = data.map(_.getInt(1).toDouble).toArray
        Vectors.sparse(size, indices, values)
      }
    }
    
    val indexer = new StringIndexer().setInputCol("category").setOutputCol("idx")
    val indexerModel = indexer.fit(df)
    val totalCategories = indexerModel.labels.size
    val dataWithIndices = indexerModel.transform(df)
    val data = dataWithIndices.groupBy("userId").agg(sort_array(collect_list(struct($"idx", $"frequency".as("val")))).as("data"))
    val dataWithFeatures = data.withColumn("features", toSparseVectorUdf(totalCategories)($"data")).drop("data")
    dataWithFeatures.show(false)
    +------+--------------------------+
    |userId|features                  |
    +------+--------------------------+
    |1     |(7,[0,1,3],[1.0,5.0,3.0]) |
    |3     |(7,[0,2,4],[5.0,16.0,2.0])|
    |2     |(7,[1,5,6],[2.0,6.0,1.0]) |
    +------+--------------------------+
    

    NOTE: StringIndexer will sort categories by frequency => most frequent category will be at index=0 in indexerModel.labels. Feel free to use your own mapping if you'd like and pass that directly to toSparseVectorUdf.

提交回复
热议问题