How are tasks distributed within a Spark cluster?

别说谁变了你拦得住时间么 提交于 2019-12-03 16:31:25

spark.parallelize(algorithms).map(...)

From the ref, "The elements of the collection are copied to form a distributed dataset that can be operated on in parallel." That means that your algorithms are going to be scattered among your nodes. From there, every algorithm will execute.

Your scheme could be valid, if the algorithms and their respective parameters were scattered that way, which I think is the case for you.

About using all your resources, is very good at this. However, you need to check that the workload is balanced among your tasks (every task to do the same amount of work), in order to get good performance.


What changes if instead of the first approach with parallelize, I use a for loop?

Everything. Your dataset (algorithms in your case) is not an RDD, thus no parallel execution occurs.

.. and also using databricks's spark-sklearn integration between Spark and scikit-learn?

This article describes how Random Forests are implemented there:

"The scikit-learn package for Spark provides an alternative implementation of the cross-validation algorithm that distributes the workload on a Spark cluster. Each node runs the training algorithm using a local copy of the scikit-learn library, and reports the best model back to the master."

We can generalize this to all your algorithms, which make your scheme reasonable.


Spark MLlib instead of scikit-learn, would the whole parallelization/distribution be taken care of?

Yes, it would. They idea of both of this library is to take care things for us, so that we make our lives easier.


I would advise you to ask one big question at a time, since the answer is too broad now, but I will try to be laconic.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!