Pyspark random forest feature importance mapping after column transformations

后端 未结 3 1652
遥遥无期
遥遥无期 2020-12-10 08:43

I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.

Since I had textual categorical variables and numeri

3条回答
  •  再見小時候
    2020-12-10 09:21

    The transformed dataset metdata has the required attributes.Here is an easy way to do -

    1. create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)

      pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"] 
      ["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
      
    2. Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.

      feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 
      
      feature_dict_broad = sc.broadcast(feature_dict)
      

提交回复
热议问题