How can I vectorize Tweets using Spark's MLLib?
问题 I'd like to turn tweets into vectors for machine learning, so that I can categorize them based on content using Spark's K-Means clustering. Ex, all tweets relating to Amazon get put into one category. I have tried splitting the tweet into words and creating a vector using HashingTF, which wasn't very successful. Are there any other ways to vectorize tweets? 回答1: You can try this pipeline: First, tokenize the input Tweet (located in the column text ). basically, it creates a new column