I have DF with 10K columns and 70Million rows. I want to calculate the mean and corr of 10K columns. I did below code but it wont work due to code size 64K issue (https://issues.apache.org/jira/browse/SPARK-16845)
Data:
region dept week sal val1 val2 val3 ... val10000
US CS 1 1 2 1 1 ... 2
US CS 2 1.5 2 3 1 ... 2
US CS 3 1 2 2 2.1 2
US ELE 1 1.1 2 2 2.1 2
US ELE 2 2.1 2 2 2.1 2
US ELE 3 1 2 1 2 .... 2
UE CS 1 2 2 1 2 .... 2
Code:
aggList = [func.mean(col) for col in df.columns] #exclude keys
df2= df.groupBy('region', 'dept').agg(*aggList)
code 2
aggList = [func.corr('sal', col).alias(col) for col in df.columns] #exclude keys
df2 = df.groupBy('region', 'dept', 'week').agg(*aggList)
this fails. Is there any alternative way to overcome this bug? and any one tried DF with 10K columns?. Is there any suggestion on performance improvement?
We also ran into the 64KB issue, but in a where clause, which is filed under another bug report. What we used as a workaround, is simply, to do the operations/transformations in several steps.
In your case, this would mean, that you don't do all the aggregatens in one step. Instead loop over the relevant columns in an outer operation:
- Use
select
to create a temporary dataframe, which just contains columns you need for the operation. - Use the
groupBy
andagg
like you did, except not for a list of aggregations, but just for on (or two, you could combine themean
andcorr
. - After you received references to all temporary dataframes, use
withColumn
to append the aggregated columns from the temporary dataframes to a result df.
Due to the lazy evaluation of a Spark DAG, this is of course slower as doing it in one operation. But it should evaluate the whole analysis in one run.
来源:https://stackoverflow.com/questions/40044779/find-mean-and-corr-of-10-000-columns-in-pyspark-dataframe