问题
I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.
Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this -
- use string indexer to index string columns
- use one hot encoder for all columns
use a vectorassembler to create the feature column containing the feature vector
Some sample code from the docs for steps 1,2,3 -
from pyspark.ml import Pipeline from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"] stages = [] # stages in our Pipeline for categoricalCol in categoricalColumns: # Category Indexing with StringIndexer stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index") # Use OneHotEncoder to convert categorical variables into binary SparseVectors # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", outputCol=categoricalCol + "classVec") encoder = OneHotEncoderEstimator(inputCols= [stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"]) # Add stages. These are not run here, but will run all at once later on. stages += [stringIndexer, encoder] numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"] assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features") stages += [assembler] # Create a Pipeline. pipeline = Pipeline(stages=stages) # Run the feature transformations. # - fit() computes feature statistics as needed. # - transform() actually transforms the features. pipelineModel = pipeline.fit(dataset) dataset = pipelineModel.transform(dataset)
finally train the model
after training and eval, I can use the "model.featureImportances" to get the feature rankings, however I dont get the feature/column names, rather just the feature number, something like this -
print dtModel_1.featureImportances (38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
How do I map it back to the initial column names and the values? So that I can plot ?**
回答1:
Extract metadata as shown here by user6910411
attrs = sorted(
(attr["idx"], attr["name"]) for attr in (chain(*dataset
.schema["features"]
.metadata["ml_attr"]["attrs"].values())))
and combine with feature importance:
[(name, dtModel_1.featureImportances[idx])
for idx, name in attrs
if dtModel_1.featureImportances[idx]]
回答2:
The transformed dataset metdata has the required attributes.Here is an easy way to do -
create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)
pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"] ["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.
feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) feature_dict_broad = sc.broadcast(feature_dict)
来源:https://stackoverflow.com/questions/50937591/pyspark-random-forest-feature-importance-mapping-after-column-transformations