How to convert a pyspark dataframe column to numpy array

一曲冷凌霜 提交于 2020-02-07 05:15:09

问题


I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array.

I need the array as an input for scipy.optimize.minimize function.

I have tried both converting to Pandas and using collect(), but these methods are very time consuming.

I am new to PySpark, If there is a faster and better approach to do this, Please help.

Thanks

This is how my dataframe looks like.

+----------+
|Adolescent|
+----------+
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
+----------+

回答1:


#1

You will have to call a .collect() in any way. To create a numpy array from the pyspark dataframe, you can use:

adoles = np.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array

#2

You can convert it to a pandas dataframe using toPandas(), and you can then convert it to numpy array using .values.

pdf = df.toPandas()
adoles = df["Adolescent"].values

Or simply:

adoles = df.select("Adolescent").toPandas().values #.reshape(-1) for 1-D array

#3

For distributed arrays, you can try Dask Arrays

I haven't tested this, but assuming it would work the same as numpy (might have inconsistencies):

import dask.array as da
adoles = da.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array


来源:https://stackoverflow.com/questions/58162761/how-to-convert-a-pyspark-dataframe-column-to-numpy-array

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!