How can I vectorize Tweets using Spark's MLLib?

问题

I'd like to turn tweets into vectors for machine learning, so that I can categorize them based on content using Spark's K-Means clustering. Ex, all tweets relating to Amazon get put into one category.

I have tried splitting the tweet into words and creating a vector using HashingTF, which wasn't very successful.

Are there any other ways to vectorize tweets?

回答1:

You can try this pipeline:

First, tokenize the input Tweet (located in the column text). basically, it creates a new column rawWords as a list of words taken from the original text. To get these words, it splits the input text by alphanumeric words (.setPattern("\\w+").setGaps(false))

val tokenizer = new RegexTokenizer()
 .setInputCol("text")
 .setOutputCol("rawWords")
 .setPattern("\\w+")
 .setGaps(false)

Secondly, you may consider remove the stop words to remove less significant words in the text, such as a, the, of, etc.

val stopWordsRemover = new StopWordsRemover()
 .setInputCol("rawWords")
 .setOutputCol("words")

Now it's time to vectorize the wordscolumn. In this example I'm using the CountVectorizerwhich is quite basic. There are many others such as the TF-ID Vectorizer. You can find more information here.

I've configured the CountVectorizerso that it creates a vocabulary with 10,000 words, each word appearing a minimum of 5 times across all document, and a minimum of 1 on each document.

val countVectorizer = new CountVectorizer()
 .setInputCol("words")
 .setOutputCol("features")
 .setVocabSize(10000)
 .setMinDF(5.0)
 .setMinTF(1.0)

Finally, just create the pipeline, and fit and transform the model generated by the pipeline by passing the dataset.

val transformPipeline = new Pipeline()
 .setStages(Array(
   tokenizer,
   stopWordsRemover,
   countVectorizer))

transformPipeline.fit(training).transform(test)

Hope it helps.

来源：https://stackoverflow.com/questions/51581688/how-can-i-vectorize-tweets-using-sparks-mllib

标签

apache-spark

vector

twitter

k-means

apache-spark-mllib