问题
I'm trying to analyse the features of the Pima Indians Diabetes Data Set (follow the link to get the dataset) by plotting their probability density distributions. I haven't yet removed invalid 0 data, so the plots sometimes show a bias at the very left. For the most part, the distributions look accurate:
I have a problem with the look of the plot for DiabetesPedigree, which shows probabilities over 1.0 (for x ~ between 0.1 and 0.5). As I understand it, the combined probabilities should equal 1.0.
I've isolated the code for the DiatebesPedigree plot but the same will work for the others by changing the dataset_index
value:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
dataset_index = 6
feature_name = "DiabetesPedigree"
filename = 'pima-indians-diabetes.data.csv'
data = pd.read_csv(filename)
feature_data = data.ix[:, dataset_index]
graph_min = feature_data.min()
graph_max = feature_data.max()
density = gaussian_kde(feature_data)
density.covariance_factor = lambda : .25
density._compute_covariance()
xs = np.arange(graph_min, graph_max, (graph_max - graph_min)/200)
ys = density(xs)
plt.xlim(graph_min, graph_max)
plt.title(feature_name)
plt.plot(xs,ys)
plt.show()
回答1:
As rightly marked , a continous pdf never says the value to be less than 1, with the pdf for continous random variable, function p(x) is not the probability. you can refer for continuous random varibales and their distrubutions
来源:https://stackoverflow.com/questions/46441481/why-does-this-kernel-density-estimation-have-values-over-1-0