I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier
In the question/answer the following function