How to cast DataFrame with Vector columns into RDD

被刻印的时光 ゝ 提交于 2019-11-29 11:55:13

Your stack trace says:

File "/usr/lib/spark/python/pyspark/mllib/__init__.py", line 25, in <module>
import numpy
ImportError: ('No module named numpy', ...

Seems like numpy is a dependency for mllib. Make sure you have numpy installed, and if you're serializing code across a wire, make sure the worker nodes can access the numpy library too.

The easiest way to get numpy is to install Anaconda.

I'm going to add another answer here which isn't related to the error, simply because I've lost track of how many times I've Googled it.

Simply put the easiest way is to create a udf which extracts the first element (or nth element) of a DenseVector.

Simple example (using pyspark shell):

from pyspark import SparkContext
from pyspark.sql import Row
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors

FeatureRow = Row('id', 'features')
data = sc.parallelize([(0, Vectors.dense([9.7, 1.0, -3.2])),
                       (1, Vectors.dense([2.25, -11.1, 123.2])),
                       (2, Vectors.dense([-7.2, 1.0, -3.2]))])
df = data.map(lambda r: FeatureRow(*r)).toDF()

vector_udf = udf(lambda vector: float(vector[1]), DoubleType())

df.withColumn('feature_sums', vector_udf(df.features)).first()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!