How to map features from the output of a VectorAssembler back to the column names in Spark ML?

后端未结

关注

 3  1704

I\'m trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in m

相关标签:

3条回答

离开以前

2020-12-01 02:54

As of today Spark doesn't provide any method that can do it for you, so if you have to create your own. Let's say your data looks like this:

import random
random.seed(1)

df = sc.parallelize([(
    random.choice([0.0, 1.0]), 
    random.choice(["a", "b", "c"]),
    random.choice(["foo", "bar"]),
    random.randint(0, 100),
    random.random(),
) for _ in range(100)]).toDF(["label", "x1", "x2", "x3", "x4"])

and is processed using following pipeline:

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression

indexers = [
  StringIndexer(inputCol=c, outputCol="{}_idx".format(c)) for c in ["x1", "x2"]]
encoders = [
    OneHotEncoder(
        inputCol=idx.getOutputCol(),
        outputCol="{0}_enc".format(idx.getOutputCol())) for idx in indexers]
assembler = VectorAssembler(
    inputCols=[enc.getOutputCol() for enc in encoders] + ["x3", "x4"],
    outputCol="features")

pipeline = Pipeline(
    stages=indexers + encoders + [assembler, LinearRegression()])
model = pipeline.fit(df)

Get the LinearRegressionModel:

lrm = model.stages[-1]

Transform the data:

transformed =  model.transform(df)

Extract and flatten ML attributes:

from itertools import chain

attrs = sorted(
    (attr["idx"], attr["name"]) for attr in (chain(*transformed
        .schema[lrm.summary.featuresCol]
        .metadata["ml_attr"]["attrs"].values())))

and map to the output:

[(name, lrm.summary.pValues[idx]) for idx, name in attrs]

[('x1_idx_enc_a', 0.26400012641279824),
 ('x1_idx_enc_c', 0.06320192217171572),
 ('x2_idx_enc_foo', 0.40447778902400433),
 ('x3', 0.1081883594783335),
 ('x4', 0.4545851609776568)]

[(name, lrm.coefficients[idx]) for idx, name in attrs]

[('x1_idx_enc_a', 0.13874401585637453),
 ('x1_idx_enc_c', 0.23498565469334595),
 ('x2_idx_enc_foo', -0.083558932128022873),
 ('x3', 0.0030186112903237442),
 ('x4', -0.12951394186593695)]

0 讨论(0)

别跟我提以往

2020-12-01 02:58

Here's the one line answer:

[x["name"] for x in sorted(train_downsampled.schema["all_features"].metadata["ml_attr"]["attrs"]["binary"]+
   train_downsampled.schema["all_features"].metadata["ml_attr"]["attrs"]["numeric"], 
   key=lambda x: x["idx"])]

Thanks to @pratiklodha for the core of this.

0 讨论(0)

爱一瞬间的悲伤

2020-12-01 03:03
You can see the actual order of the columns here
```
df.schema["features"].metadata["ml_attr"]["attrs"]
```
there will be two classes usually, ["binary] & ["numeric"]
```
pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
```
Should give the exact order of all the columns
0 讨论(0)
发布评论:

提交评论
- 加载中...