How Spark Dataframe is better than Pandas Dataframe in performance? [closed]

问题

Can anyone please explain how Spark Dataframes are better in terms of time taken for execution than Pandas Dataframes. I'm dealing with data with moderate volume and making python function powered transformations

For example, I have a column with numbers from 1 to 100,000 in my dataset and want to perform basic numeric action - creating a new column which is cube of existing numeric column.

from datetime import datetime
import numpy as np
import pandas as pd

def cube(num):
    return num**3

array_of_nums = np.arange(0,100000)

dataset = pd.DataFrame(array_of_nums, columns = ["numbers"])

start_time = datetime.now() 
# Some complex transformations...
dataset["cubed"] = [cube(x) for x in dataset.numbers]
end_time = datetime.now() 

print("Time taken :", (end_time-start_time))

The output is

Time taken : 0:00:00.109349

If i use Spark Dataframe with 10 worker nodes, can I expect the following result? (which is 1/10th of time taken by Pandas DataFrame)

Time taken : 0:00:00.010935

回答1:

1) Pandas data frame is not distributed & Spark's DataFrame is distributed. -> Hence you won't get the benefit of parallel processing in Pandas DataFrame & speed of processing in Pandas DataFrame will be less for large amount of data.

2) Spark DataFrame assures you fault tolerance (It's resilient) & pandas DataFrame does not assure it. -> Hence if your data processing got interrupted/failed in between processing then spark can regenerate the failed result set from lineage (from DAG) . Fault tolerance is not supported in Pandas. You need to implement your own framework to assure it.

来源：https://stackoverflow.com/questions/55912334/how-spark-dataframe-is-better-than-pandas-dataframe-in-performance

标签

python

apache-spark

dataframe

pyspark

databricks