问题
I am writing a project for Spark 1.4 in Scala and am currently in between converting my initial input data into spark.mllib.linalg.Vectors
and scala.immutable.Vector
that I later want to work with in my algorithm. Could someone briefly explain the difference between the two and in what situation one would be more useful to use than the other?
Thank you.
回答1:
spark.mllib.linalg.Vector
is designed for linear algebra applications. mllib
provides two different implementations - DenseVector
, SparseVector
. While you have access to useful methods like norm
or sqdist
it is rather limited otherwise.
As all data structures from org.apache.spark.mllib.linalg
it can store only 64-bit floating point numbers (scala.Double
).
If you plan to use mllib
then spark.mllib.linalg.Vector
is pretty much your only option. All the remaining data structures from mllib
, both local and distributed, are build on top of org.apache.spark.mllib.linalg.Vector
.
Otherwise, scala.immutable.Vector
is probably a much better choice. It is a general purpose, dense data structure.
It can store objects of any type, so you can have Vector[String]
for example.
Since it is Traversable
you have access to all expected methods like map
, flatMap
, reduce
, fold
, filter
, etc.
Edit: If you need algebraic operations and don't use any of the data structures from org.apache.spark.mllib.linalg.distributed
you may prefer breeze.linalg.Vector
over spark.mllib.linalg.Vector
. It supports larger set of the algebraic methods including dot
product and provides typical collection API.
来源:https://stackoverflow.com/questions/31255756/difference-between-spark-vectors-and-scala-immutable-vector