Fitting partial Gaussian

问题

I'm trying to fit a sum of gaussians using scikit-learn because the scikit-learn GaussianMixture seems much more robust than using curve_fit.

Problem: It doesn't do a great job in fitting a truncated part of even a single gaussian peak:

from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np

clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
data = np.random.randn(10000)
data = [[x] for x in data]
clf.fit(data)
data = [item for sublist in data for item in sublist]
rangeMin = int(np.floor(np.min(data)))
rangeMax = int(np.ceil(np.max(data)))
h = matplotlib.pyplot.hist(data, range=(rangeMin, rangeMax), normed=True);
plt.plot(np.linspace(rangeMin, rangeMax),
         mlab.normpdf(np.linspace(rangeMin, rangeMax),
                      clf.means_, np.sqrt(clf.covariances_[0]))[0])

gives now changing data = [[x] for x in data] to data = [[x] for x in data if x <0] in order to truncate the distribution returns Any ideas how to get the truncation fitted properly?

Note: The distribution isn't necessarily truncated in the middle, there could be anything between 50% and 100% of the full distribution left.

I would also be happy if anyone can point me to alternative packages. I've only tried curve_fit but couldn't get it to do anything useful as soon as more than two peaks are involved.

回答1:

A bit brutish, but simple solution would be to split the curve in two halfs (data = [[x] for x in data if x < 0]), mirror the left part (data.append([-data[d][0]])) and then do the regular Gaussian fit.

import numpy as np
from sklearn import mixture
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

np.random.seed(seed=42)
n = 10000

clf = mixture.GaussianMixture(n_components=1, covariance_type='full')

#split the data and mirror it
data = np.random.randn(n)
data = [[x] for x in data if x < 0]
n = len(data)
for d in range(n):
    data.append([-data[d][0]])

clf.fit(data)
data = [item for sublist in data for item in sublist]
rangeMin = int(np.floor(np.min(data)))
rangeMax = int(np.ceil(np.max(data)))
h = plt.hist(data[0:n], bins=20, range=(rangeMin, rangeMax), normed=True);
plt.plot(np.linspace(rangeMin, rangeMax),
         mlab.normpdf(np.linspace(rangeMin, rangeMax),
                      clf.means_, np.sqrt(clf.covariances_[0]))[0] * 2)

plt.show()

回答2:

lhcgeneva the problem is once you have data that doesn't include the maximum of the curve more and more possible Gaussians can fit:

Black point represent the data, red points the fitted Gaussian

In the figure, black points represent the data to fit a curve, the red points the fitted results. This result was achieved by using A Simple Algorithm for Fitting a Gaussian Function

来源：https://stackoverflow.com/questions/41924857/fitting-partial-gaussian

标签

numpy

scipy

scikit-learn

curve-fitting

gaussian