Generate random data based on existing data

好久不见. 提交于 2021-02-16 14:52:16

问题


is there a way in python to generate random data based on the distribution of the alreday existing data?

Here are the statistical parameters of my dataset:

Data
count   209.000000
mean    1.280144
std     0.374602
min     0.880000
25%     1.060000
50%     1.150000
75%     1.400000
max     4.140000

as it is no normal distribution it is not possible to do it with np.random.normal. Any Ideas?

Thank you.

Edit: Performing KDE:

from sklearn.neighbors import KernelDensity
# Gaussian KDE
kde = KernelDensity(kernel='gaussian', bandwidth=0.525566).fit(data['y'].to_numpy().reshape(-1, 1))
sns.distplot(kde.sample(2400))


回答1:


In general, real-world data doesn't exactly follow a "nice" distribution like the normal or Weibull distributions.

Similarly to machine learning, there are generally two steps to sampling from a distribution of data points:

  • Fit a data model to the data.

  • Then, predict a new data point based on that model, with the help of randomness.

There are several ways to estimate the distribution of data and sample from that estimate:

  • Kernel density estimation.
  • Gaussian mixture models.
  • Histograms.
  • Regression models.
  • Other machine learning models.

In addition, methods such as maximum likelihood estimation make it possible to fit a known distribution (such as the normal distribution) to data, but the estimated distribution is generally rougher than with kernel density estimation or other machine learning models.

See also my section "Random Numbers from a Distribution of Data Points".



来源:https://stackoverflow.com/questions/60738292/generate-random-data-based-on-existing-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!