问题
So I have an input that consists in a dataset and several ML algorithms (with parameter tuning) using scikit-learn. I have tried quite a few attempts on how to execute this as efficiently as possible but at this very moment I still don't have the proper infrastructure to assess my results. However, I lack some background on this area and I need help to get things cleared up.
Basically I want to know how the tasks are distributed in a way that exploits as much as possible all the available resources, and what is actually done implicitly (for instance by Spark) and what isn't.
This is my scenario:
I need to train many different Decision Tree models (as many as the combination of all possible parameters), many different Random Forest models, and so on...
In one of my approaches, I have a list and each of its elements corresponds to one ML algorithm and its list of parameters.
spark.parallelize(algorithms).map(lambda algorihtm: run_experiment(dataframe, algorithm))
In this function run_experiment
I create a GridSearchCV
for the corresponding ML algorithm with its parameter grid. I also set n_jobs=-1
in order to (try to) achieve maximum parallelism.
In this context, on my Spark cluster with a few nodes, does it make sense that the execution would look somewhat like this?
Or there can be one Decision Tree model and also one Random Forest model running in the same node? This is my first experience using a cluster environment so I am a bit confused on how to expect things to work.
On the other hand, what exactly changes in terms of execution, if instead of the first approach with parallelize
, I use a for
loop to sequentially iterate through my list of algorithms and create the GridSearchCV
using databricks's spark-sklearn integration between Spark and scikit-learn? The way it's illustrated in the documentation it seems something like this:
Finally, with regards to this second approach, using the same ML algorithms but instead with Spark MLlib instead of scikit-learn, would the whole parallelization/distribution be taken care of?
Sorry if most of this is a bit naive, but I really appreciate any answers or insights on this. I wanted to understand the basics before actually testing in the cluster and playing with task scheduling parameters.
I am not sure whether this question is more suitable here or on CS stackexchange.
回答1:
spark.parallelize(algorithms).map(...)
From the ref, "The elements of the collection are copied to form a distributed dataset that can be operated on in parallel." That means that your algorithms are going to be scattered among your nodes. From there, every algorithm will execute.
Your scheme could be valid, if the algorithms and their respective parameters were scattered that way, which I think is the case for you.
About using all your resources, spark is very good at this. However, you need to check that the workload is balanced among your tasks (every task to do the same amount of work), in order to get good performance.
What changes if instead of the first approach with
parallelize
, I use a for loop?
Everything. Your dataset (algorithms in your case) is not an RDD, thus no parallel execution occurs.
.. and also using databricks's spark-sklearn integration between Spark and scikit-learn?
This article describes how Random Forests are implemented there:
"The scikit-learn package for Spark provides an alternative implementation of the cross-validation algorithm that distributes the workload on a Spark cluster. Each node runs the training algorithm using a local copy of the scikit-learn library, and reports the best model back to the master."
We can generalize this to all your algorithms, which make your scheme reasonable.
Spark MLlib instead of scikit-learn, would the whole parallelization/distribution be taken care of?
Yes, it would. They idea of both of this library is to take care things for us, so that we make our lives easier.
I would advise you to ask one big question at a time, since the answer is too broad now, but I will try to be laconic.
来源:https://stackoverflow.com/questions/44202084/how-are-tasks-distributed-within-a-spark-cluster