问题
Imagine you are loading a large dataset by the SparkContext and Hive. So this dataset is then distributed in your Spark cluster. For instance a observations (values + timestamps) for thousands of variables.
Now you would use some map/reduce methods or aggregations to organize/analyze your data. For instance grouping by variable name.
Once grouped, you could get all observations (values) for each variable as a timeseries Dataframe. If you now use DataFrame.toPandas
def myFunction(data_frame):
data_frame.toPandas()
df = sc.load....
df.groupBy('var_name').mapValues(_.toDF).map(myFunction)
- is this converted to a Pandas Dataframe (per Variable) on each worker node, or
- are Pandas Dataframes always on the driver node and the data is therefore transferred from the worker nodes to the driver?
回答1:
There is nothing special about Pandas DataFrame
in this context.
- If
DataFrame
is created by usingtoPandas
method onpyspark.sql.dataframe.DataFrame
this collects data and creates local Python object on the driver. - If
pandas.core.frame.DataFrame
is created inside executor process (for example in mapPartitions) you simply getRDD[pandas.core.frame.DataFrame]
. There is no distinction between Pandas objects and let's say atuple
. - Finally pseudocode in you example couldn't work becasue you cannot create (in a sensible way) Spark
DataFrame
(I assume this what you mean by_.toDF
) inside executor thread.
来源:https://stackoverflow.com/questions/39142549/is-dataframe-topandas-always-on-driver-node-or-on-worker-nodes