How to map features from the output of a VectorAssembler back to the column names in Spark ML?

后端 未结 3 1701
别跟我提以往
别跟我提以往 2020-12-01 02:02

I\'m trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in m

相关标签:
3条回答
  • 2020-12-01 02:54

    As of today Spark doesn't provide any method that can do it for you, so if you have to create your own. Let's say your data looks like this:

    import random
    random.seed(1)
    
    df = sc.parallelize([(
        random.choice([0.0, 1.0]), 
        random.choice(["a", "b", "c"]),
        random.choice(["foo", "bar"]),
        random.randint(0, 100),
        random.random(),
    ) for _ in range(100)]).toDF(["label", "x1", "x2", "x3", "x4"])
    

    and is processed using following pipeline:

    from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
    from pyspark.ml import Pipeline
    from pyspark.ml.regression import LinearRegression
    
    indexers = [
      StringIndexer(inputCol=c, outputCol="{}_idx".format(c)) for c in ["x1", "x2"]]
    encoders = [
        OneHotEncoder(
            inputCol=idx.getOutputCol(),
            outputCol="{0}_enc".format(idx.getOutputCol())) for idx in indexers]
    assembler = VectorAssembler(
        inputCols=[enc.getOutputCol() for enc in encoders] + ["x3", "x4"],
        outputCol="features")
    
    pipeline = Pipeline(
        stages=indexers + encoders + [assembler, LinearRegression()])
    model = pipeline.fit(df)
    

    Get the LinearRegressionModel:

    lrm = model.stages[-1]
    

    Transform the data:

    transformed =  model.transform(df)
    

    Extract and flatten ML attributes:

    from itertools import chain
    
    attrs = sorted(
        (attr["idx"], attr["name"]) for attr in (chain(*transformed
            .schema[lrm.summary.featuresCol]
            .metadata["ml_attr"]["attrs"].values())))
    

    and map to the output:

    [(name, lrm.summary.pValues[idx]) for idx, name in attrs]
    
    [('x1_idx_enc_a', 0.26400012641279824),
     ('x1_idx_enc_c', 0.06320192217171572),
     ('x2_idx_enc_foo', 0.40447778902400433),
     ('x3', 0.1081883594783335),
     ('x4', 0.4545851609776568)]
    
    [(name, lrm.coefficients[idx]) for idx, name in attrs]
    
    [('x1_idx_enc_a', 0.13874401585637453),
     ('x1_idx_enc_c', 0.23498565469334595),
     ('x2_idx_enc_foo', -0.083558932128022873),
     ('x3', 0.0030186112903237442),
     ('x4', -0.12951394186593695)]
    
    0 讨论(0)
  • 2020-12-01 02:58

    Here's the one line answer:

    [x["name"] for x in sorted(train_downsampled.schema["all_features"].metadata["ml_attr"]["attrs"]["binary"]+
       train_downsampled.schema["all_features"].metadata["ml_attr"]["attrs"]["numeric"], 
       key=lambda x: x["idx"])]
    

    Thanks to @pratiklodha for the core of this.

    0 讨论(0)
  • 2020-12-01 03:03

    You can see the actual order of the columns here

    df.schema["features"].metadata["ml_attr"]["attrs"]
    

    there will be two classes usually, ["binary] & ["numeric"]

    pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
    

    Should give the exact order of all the columns

    0 讨论(0)
提交回复
热议问题