find mean and corr of 10,000 columns in pyspark Dataframe

I have DF with 10K columns and 70Million rows. I want to calculate the mean and corr of 10K columns. I did below code but it wont work due to code size 64K issue (https://issues.apache.org/jira/browse/SPARK-16845)

Data:

region dept week sal val1  val2  val3 ... val10000   
 US    CS   1     1    2    1     1   ...  2 
 US    CS   2     1.5  2    3     1   ...  2
 US    CS   3     1    2    2     2.1      2
 US    ELE  1     1.1  2    2     2.1      2
 US    ELE  2     2.1  2    2     2.1      2
 US    ELE  3     1    2    1     2   .... 2
 UE    CS   1     2    2    1     2   .... 2

Code:

aggList =  [func.mean(col) for col in df.columns]  #exclude keys
df2= df.groupBy('region', 'dept').agg(*aggList)

code 2

aggList =  [func.corr('sal', col).alias(col) for col in df.columns]  #exclude keys
df2  = df.groupBy('region', 'dept', 'week').agg(*aggList)

this fails. Is there any alternative way to overcome this bug? and any one tried DF with 10K columns?. Is there any suggestion on performance improvement?

We also ran into the 64KB issue, but in a where clause, which is filed under another bug report. What we used as a workaround, is simply, to do the operations/transformations in several steps.

In your case, this would mean, that you don't do all the aggregatens in one step. Instead loop over the relevant columns in an outer operation:

Use select to create a temporary dataframe, which just contains columns you need for the operation.
Use the groupBy and agg like you did, except not for a list of aggregations, but just for on (or two, you could combine the mean and corr.
After you received references to all temporary dataframes, use withColumn to append the aggregated columns from the temporary dataframes to a result df.

Due to the lazy evaluation of a Spark DAG, this is of course slower as doing it in one operation. But it should evaluate the whole analysis in one run.

来源：https://stackoverflow.com/questions/40044779/find-mean-and-corr-of-10-000-columns-in-pyspark-dataframe

标签

python

apache-spark

pyspark

spark-dataframe