Why is pyspark so much slower in finding the max of a column?

问题

Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column? I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value. I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.) The only thing I did ot try yet is the RDD way.

Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know

can you give me a pointer to an article discussing this difference?
is spark more sensitive to memory constraints on my computer than pandas?

回答1:

As @MattR has already said in the comment - you should use Pandas unless there's a specific reason to use Spark.

Usually you don't need Apache Spark unless you encounter MemoryError with Pandas. But if one server's RAM is not enough, then Apache Spark is the right tool for you. Apache Spark has an overhead, because it needs to split your data set first, then process those distributed chunks, then process and join "processed" data, collect it on one node and return it back to you.

回答2:

@MaxU, @MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.

sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.

In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.

On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.

Thanks for yourinput and this topic really got me much further in understanding Spark.

来源：https://stackoverflow.com/questions/43685509/why-is-pyspark-so-much-slower-in-finding-the-max-of-a-column

标签

pandas

max

spark-dataframe