PySpark - Create DataFrame from Numpy Matrix

你说的曾经没有我的故事 提交于 2019-12-22 08:34:50

问题


I have a numpy matrix:

arr = np.array([[2,3], [2,8], [2,3],[4,5]])

I need to create a PySpark Dataframe from arr. I can not manually input the values because the length/values of arr will be changing dynamically so I need to convert arr into a dataframe.

I tried the following code to no success.

df= sqlContext.createDataFrame(arr,["A", "B"])

However, I get the following error.

TypeError: Can not infer schema for type: <type 'numpy.ndarray'>

回答1:


Hope this helps!

import numpy as np

#sample data
arr = np.array([[2,3], [2,8], [2,3],[4,5]])

rdd1 = sc.parallelize(arr)
rdd2 = rdd1.map(lambda x: [int(i) for i in x])
df = rdd2.toDF(["A", "B"])
df.show()

Output is:

+---+---+
|  A|  B|
+---+---+
|  2|  3|
|  2|  8|
|  2|  3|
|  4|  5|
+---+---+



回答2:


import numpy as np
from pyspark.ml.linalg import Vectors
arr = np.array([[2,3], [2,8], [2,3],[4,5]])
df = np.concatenate(arr).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(dff,schema=["label", "features"])
mydf.show(5)

Try this will work..



来源:https://stackoverflow.com/questions/48206622/pyspark-create-dataframe-from-numpy-matrix

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!