How Spark Dataframe is better than Pandas Dataframe in performance? [closed]

我的未来我决定 提交于 2020-12-04 05:43:57

问题


Can anyone please explain how Spark Dataframes are better in terms of time taken for execution than Pandas Dataframes. I'm dealing with data with moderate volume and making python function powered transformations

For example, I have a column with numbers from 1 to 100,000 in my dataset and want to perform basic numeric action - creating a new column which is cube of existing numeric column.

from datetime import datetime
import numpy as np
import pandas as pd

def cube(num):
    return num**3

array_of_nums = np.arange(0,100000)

dataset = pd.DataFrame(array_of_nums, columns = ["numbers"])

start_time = datetime.now() 
# Some complex transformations...
dataset["cubed"] = [cube(x) for x in dataset.numbers]
end_time = datetime.now() 

print("Time taken :", (end_time-start_time))

The output is

Time taken : 0:00:00.109349

If i use Spark Dataframe with 10 worker nodes, can I expect the following result? (which is 1/10th of time taken by Pandas DataFrame)

Time taken : 0:00:00.010935

回答1:


1) Pandas data frame is not distributed & Spark's DataFrame is distributed. -> Hence you won't get the benefit of parallel processing in Pandas DataFrame & speed of processing in Pandas DataFrame will be less for large amount of data.

2) Spark DataFrame assures you fault tolerance (It's resilient) & pandas DataFrame does not assure it. -> Hence if your data processing got interrupted/failed in between processing then spark can regenerate the failed result set from lineage (from DAG) . Fault tolerance is not supported in Pandas. You need to implement your own framework to assure it.



来源:https://stackoverflow.com/questions/55912334/how-spark-dataframe-is-better-than-pandas-dataframe-in-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!