Fitting data vs. transforming data in scikit-learn

问题

In scikit-learn, all estimators have a fit() method, and depending on whether they are supervised or unsupervised, they also have a predict() or transform() method.

I am in the process of writing a transformer for an unsupervised learning task and was wondering if there is a rule of thumb where to put which kind of learning logic. The official documentation is not very helpful in this regard:

fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.

In this context, what is meant by both fitting data and transforming data?

回答1:

Fitting finds the internal parameters of a model that will be used to transform data. Transforming applies the parameters to data. You may fit a model to one set of data, and then transform it on a completely different set.

For example, you fit a linear model to data to get a slope and intercept. Then you use those parameters to transform (i.e., map) new or existing values of x to y.

fit_transform is just doing both steps to the same data.

A scikit example: You fit data to find the principal components. Then you transform your data to see how it maps onto these components:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

X = [[1,2],[2,4],[1,3]]

pca.fit(X)

# This is the model to map data
pca.components_

array([[ 0.47185791,  0.88167459],
       [-0.88167459,  0.47185791]], dtype=float32)

# Now we actually map the data
pca.transform(X)

array([[-1.03896057, -0.17796634],
       [ 1.19624651, -0.11592512],
       [-0.15728599,  0.29389156]])

# Or we can do both "at once"
pca.fit_transform(X)

array([[-1.03896058, -0.1779664 ],
       [ 1.19624662, -0.11592512],
       [-0.15728603,  0.29389152]], dtype=float32)

回答2:

As other answers explain it, fit does not need to be doing anything (except from returning the transformer object). It is there so that all transformers have the same interface and work nicely with stuff like pipelines.
Of course some transformers need a fit method (think tf-idf, PCA...) that actually does things.
The transform method needs to return the transformed data.

fit_transform is a convenience method that chains the fit and transform operations. You can get it for free (!) by deriving your custom transformer class from TransformerMixin and implementing fit and transform.

Hope this clarifies it a bit.

回答3:

In this case, calling the fit method does not do anything. As you can see in this example, not all transformers need to actually do something with fit or transform methods. My guess is that every class in scikit-learn should implement the fit, transform and/or predict in order for it to be consistent with the rest of the package. But I guess this is indeed quite an overkill.

来源：https://stackoverflow.com/questions/31572487/fitting-data-vs-transforming-data-in-scikit-learn

标签

machine-learning

scikit-learn