How to I convert multiple Pandas DFs into a single Spark DF?

问题

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:

file_list_rdd = sc.emptyRDD()

for file_path in file_list:
    current_file_rdd = sc.binaryFiles(file_path)
    print(current_file_rdd.count())
    file_list_rdd = file_list_rdd.union(current_file_rdd)

I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.

How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.

My first attempt was something like this:

sqlCtx = SQLContext(sc)

def convert_pd_df_to_spark_df(item):
    return sqlCtx.createDataFrame(item[0][1])

processed_excel_rdd.map(convert_pd_df_to_spark_df)

I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).

Thanks in advance for taking the time to read :).

回答1:

Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.

https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5

This snippet monkey-patches spark to include a createFromPandasDataframesRDD method. The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.

回答2:

Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:

If pandas dataframes:

dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
    if sdf:
        sdf = sdf.union(spark.createDataFrame(df))
    else:
        sdf = spark.createDataFrame(df)

If filenames:

names = [name1, name2, name3, name4]
sdf = None
for name in names:
    if sdf:
        sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
    else:
        sdf = spark.createDataFrame(pd.read_excel(name))

回答3:

I solved this by writing a function like this:

def pd_df_to_row(rdd_row):
    key = rdd_row[0]
    pd_df = rdd_row[1]        

    rows = list()
    for index, series in pd_df.iterrows():
        # Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor

        row_dict = {str(k):v for k,v in series.to_dict().items()}
        rows.append(Row(**row_dict))

    return rows

You can invoke it by calling something like:

processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)

pd_df_to_row now has a collection of Spark Row objects. You can now say:

processed_excel_rdd.toDF()

There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.

来源：https://stackoverflow.com/questions/43457596/how-to-i-convert-multiple-pandas-dfs-into-a-single-spark-df

标签

pandas

apache-spark

pyspark