I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.
Since I had textual categorical variables and numeri
Extract metadata as shown here by user6910411
attrs = sorted(
(attr["idx"], attr["name"]) for attr in (chain(*dataset
.schema["features"]
.metadata["ml_attr"]["attrs"].values())))
and combine with feature importance:
[(name, dtModel_1.featureImportances[idx])
for idx, name in attrs
if dtModel_1.featureImportances[idx]]
When creating your assembler you used a list of variables (assemblerInputs). The order is preserved in 'features' variable. So just do a Pandas DataFrame:
features_imp_pd = (
pd.DataFrame(
dtModel_1.featureImportances.toArray(),
index=assemblerInputs,
columns=['importance'])
)
The transformed dataset metdata has the required attributes.Here is an easy way to do -
create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)
pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"]
["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.
feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"]))
feature_dict_broad = sc.broadcast(feature_dict)