is Dataframe.toPandas always on driver node or on worker nodes?

问题

Imagine you are loading a large dataset by the SparkContext and Hive. So this dataset is then distributed in your Spark cluster. For instance a observations (values + timestamps) for thousands of variables.

Now you would use some map/reduce methods or aggregations to organize/analyze your data. For instance grouping by variable name.

Once grouped, you could get all observations (values) for each variable as a timeseries Dataframe. If you now use DataFrame.toPandas

def myFunction(data_frame):
   data_frame.toPandas()

df = sc.load....
df.groupBy('var_name').mapValues(_.toDF).map(myFunction)

is this converted to a Pandas Dataframe (per Variable) on each worker node, or
are Pandas Dataframes always on the driver node and the data is therefore transferred from the worker nodes to the driver?

回答1:

There is nothing special about Pandas DataFrame in this context.

If DataFrame is created by using toPandas method on pyspark.sql.dataframe.DataFrame this collects data and creates local Python object on the driver.
If pandas.core.frame.DataFrame is created inside executor process (for example in mapPartitions) you simply get RDD[pandas.core.frame.DataFrame]. There is no distinction between Pandas objects and let's say a tuple.
Finally pseudocode in you example couldn't work becasue you cannot create (in a sensible way) Spark DataFrame (I assume this what you mean by _.toDF) inside executor thread.

来源：https://stackoverflow.com/questions/39142549/is-dataframe-topandas-always-on-driver-node-or-on-worker-nodes

标签

python

Hadoop

pandas

apache-spark

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!