I am using Spark
and pyspark
and I have a pipeline
set up with a bunch of StringIndexer
objects, that I use to encode the string columns to columns of indices:
indexers = [StringIndexer(inputCol=column, outputCol=column + '_index').setHandleInvalid('skip') for column in list(set(data_frame.columns) - ignore_columns)] pipeline = Pipeline(stages=indexers) new_data_frame = pipeline.fit(data_frame).transform(data_frame)
The problem is, that I need to get the list of labels for each StringIndexer
object after it gets fitted. For a single column and a single StringIndexer
without a pipeline, it's an easy task. I can just access the labels
attribute after fitting the indexer on the DataFrame
:
indexer = StringIndexer(inputCol="name", outputCol="name_index") indexer_fitted = indexer.fit(data_frame) labels = indexer_fitted.labels new_data_frame = indexer_fitted.transform(data_frame)
However when I use the pipeline, this doesn't seem possible, or at least I don't know how to do this.
So I guess my question comes down to: Is there a way to access the labels that were used during the indexing process for each individual column?
Or will I have to ditch the pipeline in this use-case, and for example loop through the list of StringIndexer
objects and do it manually? (I'm sure that would possible. However using the pipeline would just be a lot nicer)