RDD to LabeledPoint conversion

前端 未结 1 1370
粉色の甜心
粉色の甜心 2020-12-19 12:51

If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf(\"target\", 0) shows Int = 77 which tells me my targeted depen

相关标签:
1条回答
  • 2020-12-19 13:14

    I assume your data looks more or less like this:

    import scala.util.Random.{setSeed, nextDouble}
    setSeed(1)
    
    case class Record(
        foo: Double, target: Double, x1: Double, x2: Double, x3: Double)
    
    val rows = sc.parallelize(
        (1 to 10).map(_ => Record(
            nextDouble, nextDouble, nextDouble, nextDouble, nextDouble
       ))
    )
    val df = sqlContext.createDataFrame(rows)
    df.registerTempTable("df")
    
    sqlContext.sql("""
      SELECT ROUND(foo, 2) foo,
             ROUND(target, 2) target,
             ROUND(x1, 2) x1,
             ROUND(x2, 2) x2,
             ROUND(x2, 2) x3 
      FROM df""").show
    

    So we have data as below:

    +----+------+----+----+----+
    | foo|target|  x1|  x2|  x3|
    +----+------+----+----+----+
    |0.73|  0.41|0.21|0.33|0.33|
    |0.01|  0.96|0.94|0.95|0.95|
    | 0.4|  0.35|0.29|0.51|0.51|
    |0.77|  0.66|0.16|0.38|0.38|
    |0.69|  0.81|0.01|0.52|0.52|
    |0.14|  0.48|0.54|0.58|0.58|
    |0.62|  0.18|0.01|0.16|0.16|
    |0.54|  0.97|0.25|0.39|0.39|
    |0.43|  0.23|0.89|0.04|0.04|
    |0.66|  0.12|0.65|0.98|0.98|
    +----+------+----+----+----+
    

    and we want to ignore foo and x2 and extract LabeledPoint(target, Array(x1, x3)):

    // Map feature names to indices
    val featInd = List("x1", "x3").map(df.columns.indexOf(_))
    
    // Or if you want to exclude columns
    val ignored = List("foo", "target", "x2")
    val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_))
    
    // Get index of target
    val targetInd = df.columns.indexOf("target") 
    
    df.rdd.map(r => LabeledPoint(
       r.getDouble(targetInd), // Get target value
       // Map feature indices to values
       Vectors.dense(featInd.map(r.getDouble(_)).toArray) 
    ))
    
    0 讨论(0)
提交回复
热议问题