apache spark MLLib: how to build labeled points for string features?

ぃ、小莉子 提交于 2020-01-01 07:38:19

问题


I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents.

I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like LabeledPoint[Double, List[Pair[Double,Double]].

Instead what I have as output from the rest of my code would be something like LabeledPoint[Double, List[Pair[String,Double]].

I could make up my own conversion, but it seems odd. How am I supposed to handle this using MLLib?

I believe the answer is in the HashingTF class (i.e. hashing features) but I don't understand how that works, it appears that it takes some sort of capacity value, but my list of keywords and topics is effectively unbounded (or better, unknown at the beginning).


回答1:


HashingTF uses the hashing trick to map a potentially unbounded number of features to a vector of bounded size. There is the possibility of feature collisions but this can be made smaller by choosing a larger number of features in the constructor.

In order to create features based on not only the content of a feature but also some metadata (e.g. having a tag of 'cats' as opposed to having the word 'cats' in the document) you could feed the HashingTF class something like 'tag:cats' so that a tag with a word would hash to a different slot than just the word.

If you've created feature count vectors using HashingTF you can use them to create bag of words features by setting any counts above zero to 1. You can also create TF-IDF vectors using the IDF class like so:

val tfIdf = new IDF().fit(featureCounts).transform(featureCounts)

In your case it looks like you've already computed the counts of words per document. This won't work with the HashingTF class since it's designed to do the counting for you.

This paper has some arguments about why feature collisions aren't that much of a problem in language applications. The essential reasons are that most words are uncommon (due to properties of languages) and that collisions are independent of word frequencies (due to hashing properties) so that it's unlikely that words that are common enough to help with one's models will both hash to the same slot.



来源:https://stackoverflow.com/questions/27334694/apache-spark-mllib-how-to-build-labeled-points-for-string-features

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!