Pyspark random forest feature importance mapping after column transformations

后端 未结 3 1649
遥遥无期
遥遥无期 2020-12-10 08:43

I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.

Since I had textual categorical variables and numeri

相关标签:
3条回答
  • 2020-12-10 09:08

    Extract metadata as shown here by user6910411

    attrs = sorted(
        (attr["idx"], attr["name"]) for attr in (chain(*dataset
            .schema["features"]
            .metadata["ml_attr"]["attrs"].values())))
    

    and combine with feature importance:

    [(name, dtModel_1.featureImportances[idx])
     for idx, name in attrs
     if dtModel_1.featureImportances[idx]]
    
    0 讨论(0)
  • When creating your assembler you used a list of variables (assemblerInputs). The order is preserved in 'features' variable. So just do a Pandas DataFrame:

    features_imp_pd = (
         pd.DataFrame(
           dtModel_1.featureImportances.toArray(), 
           index=assemblerInputs, 
           columns=['importance'])
    )
    
    0 讨论(0)
  • 2020-12-10 09:21

    The transformed dataset metdata has the required attributes.Here is an easy way to do -

    1. create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)

      pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"] 
      ["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
      
    2. Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.

      feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 
      
      feature_dict_broad = sc.broadcast(feature_dict)
      
    0 讨论(0)
提交回复
热议问题