PySpark: Output of OneHotEncoder looks odd [duplicate]

问题

The Spark documentation contains a PySpark example for its OneHotEncoder:

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = spark.createDataFrame([
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, "a"),
    (4, "a"),
    (5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

I was expecting the column categoryVec to look like this:

[0.0, 0.0]
[1.0, 0.0]
[0.0, 1.0]
[0.0, 0.0]
[0.0, 0.0]
[0.0, 1.0]

But categoryVec actually looks like this:

(2, [0], [1.0])
    (2, [], [])
(2, [1], [1.0])
(2, [0], [1.0])
(2, [0], [1.0])
(2, [1], [1.0])

What does this mean? How should I read this output, and what is the reasoning behind this somewhat odd format?

回答1:

Nothing odd here. These are just SparseVectors where:

The first element is size of the vector
The first array [...] is a list of indices.
The second array is the list of values.

Indices not explicitly listed are 0.0.

来源：https://stackoverflow.com/questions/49632830/pyspark-output-of-onehotencoder-looks-odd

标签

apache-spark

pyspark

apache-spark-mllib

one-hot-encoding

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!