Spark dataframe to numpy array via udf or without collecting to driver

旧街凉风 提交于 2020-04-30 09:48:46


Real life df is a massive dataframe that cannot be loaded into driver memory. Can this be done using regular or pandas udf?

# Code to generate a sample dataframe

from pyspark.sql import functions as F
from pyspark.sql.types import *
import pandas as pd
import numpy as np

sample = [['123',[[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],
      ['345',[[1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],

df = spark.createDataFrame(sample,["id", "data"])

Here's the logic that needs to be parallelized without relying on driver memory.

Input: Spark dataframe Output: numpy array to be fed into horovod (Something like this:

pandas_df = df.toPandas() # Not possible in real life
data_array = np.asarray(list(
data_array = data_array.reshape(data_array.shape[0], data_array.shape[1], -1, 1, order='F')
data_array = data_array.reshape(data_array.shape[0],data_array.shape[1],-1,1,1,order="F").transpose(0,1,3,2,-1)
# Some more numpy specific transformations ..

Here's an approach that didn't work:

@pandas_udf(ArrayType(IntegerType()), PandasUDFType.SCALAR)
def generate_feature(x):
    data_array = np.asarray(x)
    data_array = data_array.reshape(data_array.shape[0], ..
    return pd.Series(data_array)

df = df.withColumn("data_array", generate_feature(


I am trying to work on a similar case though using Images. I am looking towards Petastorm for doing this. You can save your data from Rdd to Parquet format and then use it in horovod.
- I am yet to test this.
- How to fetch the dataset in parts using ranks in horovod, needs to be tested too.
Just a tip that could help.

