PySpark: Output of OneHotEncoder looks odd [duplicate]

こ雲淡風輕ζ 提交于 2020-03-25 18:23:16

问题


The Spark documentation contains a PySpark example for its OneHotEncoder:

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = spark.createDataFrame([
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, "a"),
    (4, "a"),
    (5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

I was expecting the column categoryVec to look like this:

[0.0, 0.0]
[1.0, 0.0]
[0.0, 1.0]
[0.0, 0.0]
[0.0, 0.0]
[0.0, 1.0]

But categoryVec actually looks like this:

(2, [0], [1.0])
    (2, [], [])
(2, [1], [1.0])
(2, [0], [1.0])
(2, [0], [1.0])
(2, [1], [1.0])    

What does this mean? How should I read this output, and what is the reasoning behind this somewhat odd format?


回答1:


Nothing odd here. These are just SparseVectors where:

  • The first element is size of the vector
  • The first array [...] is a list of indices.
  • The second array is the list of values.

Indices not explicitly listed are 0.0.



来源:https://stackoverflow.com/questions/49632830/pyspark-output-of-onehotencoder-looks-odd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!