问题
Before I used VectorAssembler() to consolidate some OneHotEncoded categorical features... My data frame looked like so :
| Numerical| HotEncoded1| HotEncoded2
| 14460.0| (44,[5],[1.0])| (3,[0],[1.0])|
| 14460.0| (44,[9],[1.0])| (3,[0],[1.0])|
| 15181.0| (44,[1],[1.0])| (3,[0],[1.0])|
The first column is a numerical column and the other two columns represent the transformed data set for OneHotEncoded categorical features. After applying VectorAssembler(), my output becomes:
[(48,[0,1,9],[14460.0,1.0,1.0])]
[(48,[0,3,25],[12827.0,1.0,1.0])]
[(48,[0,1,18],[12828.0,1.0,1.0])]
I am unsure of what these numbers mean and cannot make sense of this transformed data set. Some clarification on what this output means would be great!
回答1:
This output is not specific to VectorAssembler
. It is just a string representation of o.a.s.ml.linalg.SparseVector
(o.a.s.mllib.linalg.SparseVector
in Spark < 2.0) with:
- leading number representing the length of a vector
- the first first set of numbers in brackets is a list of non-zero indices
- the second set of numbers in brackets is a list of values corresponding to the indices
So (48,[0,1,9],[14460.0,1.0,1.0])
represents a vector of length 48, with three non-zero entries:
- 14460.0 at the 0th position
- 1.0 at the 1st position
- 1.0 at the 9th position
Pretty much the same description applies to HotEncoded1
and HotEncoded2
and Numerical
is just a scalar. Without seeing metadata and constructors it is not possible to tell much but encoded variables should have either 44 and 3 or 45 and 4 levels (depending on a dropLast
parameter).
来源:https://stackoverflow.com/questions/38236389/understanding-representation-of-vector-column-in-spark-sql