I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD. To get a (labe
While @zero323 answer https://stackoverflow.com/a/32745924/1333621 makes sense, and I wish it worked for me - the rdd underlying the dataframe, sqlContext.createDataFrame(temp_rdd, schema), the still contained SparseVectors types I had to do the following to convert to DenseVector types - if someone has a shorter/better way I want to know
temp_rdd = sc.parallelize([
(0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
(1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])
schema = StructType([
StructField("label", DoubleType(), True),
StructField("features", VectorUDT(), True)
])
temp_rdd.toDF(schema).printSchema()
df_w_ftr = temp_rdd.toDF(schema)
print 'original convertion method: ',df_w_ftr.take(5)
print('\n')
temp_rdd_dense = temp_rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
print type(temp_rdd_dense), type(temp_rdd)
print 'using map and toArray:', temp_rdd_dense.take(5)
temp_rdd_dense.toDF().show()
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
original convertion method: [Row(label=0.0, features=SparseVector(4, {1: 1.0, 3: 5.5})), Row(label=1.0, features=SparseVector(4, {0: -1.0, 2: 0.5}))]
using map and toArray: [Row(features=DenseVector([0.0, 1.0, 0.0, 5.5]), label=0.0), Row(features=DenseVector([-1.0, 0.0, 0.5, 0.0]), label=1.0)]
+------------------+-----+
| features|label|
+------------------+-----+
| [0.0,1.0,0.0,5.5]| 0.0|
|[-1.0,0.0,0.5,0.0]| 1.0|
+------------------+-----+