How to encode categorical features in Apache Spark

点点圈 提交于 2019-11-29 00:24:23
huitseeker

You can use spark.ml's OneHotEncoder.

You first use:

OneHotEncoder.categories(rdd, categoricalFields)

Where categoricalField is the sequence of indexes at which your RDD contains categorical data. categories, given a dataset and the index of columns which are categorical variables, returns a structure that, for each field, describes the values that are present for in the dataset. That map is meant to be used as input to the encode method:

OneHotEncoder.encode(rdd, categories)

Which returns your vectorized RDD[Array[T]].

If using built-in OneHotEncoder is not an option and you have only a single variable implementing poor man's one-hot is more or less straightforward. First lets create an example data:

import org.apache.spark.mllib.linalg.{Vector, Vectors}

val rdd = sc.parallelize(List(
    Array("user1", "class1", "product1"),
    Array("user1", "class1", "product2"),
    Array("user1", "class1", "product5"),
    Array("user2", "class1", "product2"),
    Array("user2", "class1", "product5"),
    Array("user3", "class2", "product1")))

Next we have to create a mapping from a value to the index:

val prodMap = sc.broadcast(rdd.map(_(2)).distinct.zipWithIndex.collectAsMap)

and a simple encoding function:

def encodeProducts(products: Iterable[String]): Vector =  {
    Vectors.sparse(
        prodMap.value.size,
        products.map(product => (prodMap.value(product).toInt, 1.0)).toSeq
    )
}

Finally we can apply it to the dataset:

rdd.map(x => ((x(0), x(1)), x(2))).groupByKey.mapValues(encodeProducts)

It is relatively easy to extend above to handle multiple variables.

Edit:

If number of products is to large to make broadcasting useful it should be possible to use join instead. First we can create similar mapping from product to index but keep it as a RDD:

import org.apache.spark.HashPartitioner

val nPartitions = ???

val prodMapRDD = rdd
     .map(_(2))
     .distinct
     .zipWithIndex
     .partitionBy(new HashPartitioner(nPartitions))
     .cache

val nProducts = prodMapRDD.count // Should be < Int.MaxValue

Next we reshape input RDD to get PairRDD indexed by product:

val pairs = rdd
    .map(rec => (rec(2), (rec(0), rec(1))))
    .partitionBy(new HashPartitioner(nPartitions))

Finally we can join both

def indicesToVec(n: Int)(indices: Iterable[Long]): Vector = {
     Vectors.sparse(n, indices.map(x => (x.toInt, 1.0)).toSeq)
}

pairs.join(prodMapRDD)
   .values
   .groupByKey
   .mapValues(indicesToVec(nProducts.toInt))

Original question asks for the easiest way to specify categorical features from non-categorical.

In Spark ML, you can use VectorIndexer's setMaxCategories method, where you do not have to specify the fields - instead, it will understand as categorical those fields with lower or equal cardinality than a given number (in this case, 2).

val indexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexed")
.setMaxCategories(10)

Please see this reply for details.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!