How best to fit many Spark ML models

(PySpark, either Spark 1.6 or 2.0, shared YARN cluster with dozens of nodes)

I'd like to run a bootstrapping analysis, with each boot strap sample running on a dataset that's too large to fit on a single executor.

The naive approach I was going to start with is:

create spark dataframe of training dataset
for i in (1,1000):
- use df.sample() to create a sample_df
- train the model (logistic classifier) on sample_df

Although each individual model is fit across the cluster, this doesn't seem to be very 'parallel' thinking.

Should I be doing this a different way?

来源：https://stackoverflow.com/questions/42859520/how-best-to-fit-many-spark-ml-models

标签

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!