Spark ML VectorAssembler returns strange output

て烟熏妆下的殇ゞ 提交于 2019-11-26 11:34:34

问题


I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this.

My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also calculate some extra columns. My parsing function returns this:

val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined
(label, orderNo, pageNo, Vectors.dense(joinedCounts))

My main function uses the parsing function like this:

val parsedData = rawData.filter(row => row != header).map(parseLine)
val data = sqlContext.createDataFrame(parsedData).toDF(\"label\", \"orderNo\", \"pageNo\",\"joinedCounts\")

I then use a VectorAssembler like this:

val assembler = new VectorAssembler()
                           .setInputCols(Array(\"orderNo\", \"pageNo\", \"joinedCounts\"))
                           .setOutputCol(\"features\")

val assemblerData = assembler.transform(data)

So when I print a row of my data before it goes into the VectorAssembler it looks like this:

[3.2,17.0,15.0,[0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0]]

After the transform function of VectorAssembler I print the same row of data and get this:

[3.2,(18,[0,1,6,9,14,17],[17.0,15.0,3.0,1.0,4.0,2.0])]

What on earth is going on? What has the VectorAssembler done? I \'ve double checked all the calculations and even followed the simple Spark examples and cannot see what is wrong with my code. Can you?


回答1:


There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation.

To explain further :

It seems like your vector is composed of 18 elements (dimension).

This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0]

Sparse Vector representation is a way to save computational space thus easier and faster to compute. More on Sparse representation here.

Now of course you can convert that sparse representation to a dense representation but it comes at a cost.

In case you are interested in getting feature importance, thus I advise you to take a look at this.



来源:https://stackoverflow.com/questions/40505805/spark-ml-vectorassembler-returns-strange-output

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!