How best to fit many Spark ML models

試著忘記壹切 提交于 2019-12-11 02:25:33

问题


(PySpark, either Spark 1.6 or 2.0, shared YARN cluster with dozens of nodes)

I'd like to run a bootstrapping analysis, with each boot strap sample running on a dataset that's too large to fit on a single executor.

The naive approach I was going to start with is:

  • create spark dataframe of training dataset
  • for i in (1,1000):
    • use df.sample() to create a sample_df
    • train the model (logistic classifier) on sample_df

Although each individual model is fit across the cluster, this doesn't seem to be very 'parallel' thinking.

Should I be doing this a different way?

来源:https://stackoverflow.com/questions/42859520/how-best-to-fit-many-spark-ml-models

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!