find mean and corr of 10,000 columns in pyspark Dataframe

冷暖自知 提交于 2019-12-04 12:50:30

We also ran into the 64KB issue, but in a where clause, which is filed under another bug report. What we used as a workaround, is simply, to do the operations/transformations in several steps.

In your case, this would mean, that you don't do all the aggregatens in one step. Instead loop over the relevant columns in an outer operation:

  • Use select to create a temporary dataframe, which just contains columns you need for the operation.
  • Use the groupBy and agg like you did, except not for a list of aggregations, but just for on (or two, you could combine the mean and corr.
  • After you received references to all temporary dataframes, use withColumn to append the aggregated columns from the temporary dataframes to a result df.

Due to the lazy evaluation of a Spark DAG, this is of course slower as doing it in one operation. But it should evaluate the whole analysis in one run.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!