Pyspark random forest feature importance mapping after column transformations
I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark. Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this - use string indexer to index string columns use one hot encoder for all columns use a vectorassembler to create the feature column containing the feature vector Some sample code from the docs for steps 1,2,3 - from pyspark.ml import Pipeline from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler categoricalColumns = ["workclass",