How to determine what is the probability distribution function from a numpy array?

后端未结

关注

 2  2039

旧时难觅i 2021-01-31 19:13

I have searched around and to my surprise it seems that this question has not been answered.

I have a Numpy array containing 10000 values from measurements. I have plott

2条回答

轮回少年 (楼主)

2021-01-31 19:38
Testing if a large sample coming from measurements fits a given distribution is usually tricky, because any departure from the distribution will be identified by the test as an outlier, and make the test reject the distribution.

This is why I generally use the QQ-Plot for this purpose. This is a graphical tool where the X-axis plots the quantiles of the data and the Y-axis plots the quantiles of the fitted distribution. The graphical analysis allows to select which part of the distribution is important for the specific study : central dispersion, lower tail or upper tail.

To do this, I use the DrawQQplot function.
```
import openturns as ot
import numpy as np
sample = ot.Sample(s, 1)
tested_distribution = ot.NormalFactory().build(sample)
QQ_plot = ot.VisualTest.DrawQQplot(sample, tested_distribution)
```
This produces the following graphics.

The QQ-Plot validates the distribution the points are on the test line. In the current situation, the fit is excellent, although we notice that extreme quantiles of the data do not fit so well (as we might expect, given the low probability density of these events).

Just to see what happens often, I tried the BetaFactory, which is obviously a wrong choice here!
```
tested_distribution = ot.BetaFactory().build(sample)
QQ_plot = ot.VisualTest.DrawQQplot(sample, tested_distribution)
```
This produces:

The qq-plot is now clear: the fit is acceptable in the central area, but cannot be accepted for quantiles lower than -0.2 or greater than 0.2. Notice that the Beta and its 4 parameters is sufficiently flexible to perform a good job of fitting the data in the [0.2, 0.2] interval.

With a large sample size, I would rather use a KernelSmoothing than an histogram. This is more accurate i.e. closer to the true, unknown PDF (in terms of AMISE error, the kernel smoothing can reach 1/n^{4/5} instead of 1/n^{2/3} for the histogram) and is a continuous distribution (your distribution seems continuous). If the sample is really large, binning can be activated, which reduces the CPU cost.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...