Create labeledPoints from Spark DataFrame in Python

前端 未结 1 426
天涯浪人
天涯浪人 2020-12-08 12:03

What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not t

相关标签:
1条回答
  • 2020-12-08 12:29

    If you already have numerical features and which require no additional transformations you can use VectorAssembler to combine columns containing independent variables:

    from pyspark.ml.feature import VectorAssembler
    
    assembler = VectorAssembler(
        inputCols=["your", "independent", "variables"],
        outputCol="features")
    
    transformed = assembler.transform(parsedData)
    

    Next you can simply map:

    from pyspark.mllib.regression import LabeledPoint
    from pyspark.sql.functions import col
    
    (transformed.select(col("outcome_column").alias("label"), col("features"))
      .rdd
      .map(lambda row: LabeledPoint(row.label, row.features)))
    

    As of Spark 2.0 ml and mllib API are no longer compatible and the latter one is going towards deprecation and removal. If you still need this you'll have to convert ml.Vectors to mllib.Vectors.

    from pyspark.mllib import linalg as mllib_linalg
    from pyspark.ml import linalg as ml_linalg
    
    def as_old(v):
        if isinstance(v, ml_linalg.SparseVector):
            return mllib_linalg.SparseVector(v.size, v.indices, v.values)
        if isinstance(v, ml_linalg.DenseVector):
            return mllib_linalg.DenseVector(v.values)
        raise ValueError("Unsupported type {0}".format(type(v)))
    

    and map:

    lambda row: LabeledPoint(row.label, as_old(row.features)))
    
    0 讨论(0)
提交回复
热议问题