问题
(PySpark, either Spark 1.6 or 2.0, shared YARN cluster with dozens of nodes)
I'd like to run a bootstrapping analysis, with each boot strap sample running on a dataset that's too large to fit on a single executor.
The naive approach I was going to start with is:
- create spark dataframe of training dataset
- for i in (1,1000):
- use df.sample() to create a sample_df
- train the model (logistic classifier) on sample_df
Although each individual model is fit across the cluster, this doesn't seem to be very 'parallel' thinking.
Should I be doing this a different way?
来源:https://stackoverflow.com/questions/42859520/how-best-to-fit-many-spark-ml-models