Mini batch-training of a scikit-learn classifier where I provide the mini batches

偶尔善良 提交于 2019-12-07 07:33:32

问题


I have a very big dataset that can not be loaded in memory.

I want to use this dataset as training set of a scikit-learn classifier - for example a LogisticRegression.

Is there the possibility to perform a mini batch-training of a scikit-learn classifier where I provide the mini batches?


回答1:


I believe that some of the classifiers in sklearn have a partial_fit method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each minibatch. You would simply load a minibatch from disk, pass it to partial_fit, release the minibatch from memory, and repeat.

If you are particularly interested in doing this for Logistic Regression, then you'll want to use SGDClassifier, which can be set to use logistic regression when loss = 'log'.

You simply pass the features and labels for your minibatch to partial_fit in the same way that you would use fit:

clf.partial_fit(X_minibatch, y_minibatch)

Update:

I recently came across the dask-ml library which would make this task very easy by combining dask arrays with partial_fit. There is an example on the linked webpage.




回答2:


Have a look at the scaling strategies included in the sklearn documentation: http://scikit-learn.org/stable/modules/scaling_strategies.html

A good example is provided here: http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html



来源:https://stackoverflow.com/questions/46927095/mini-batch-training-of-a-scikit-learn-classifier-where-i-provide-the-mini-batche

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!