How can I subsample an array according to its density? (Remove frequent values, keep rare ones)

后端 未结 2 682
遇见更好的自我
遇见更好的自我 2020-12-10 00:32

I have this problem that I want to plot a data distribution where some values occur frequently while others are quite rare. The number of points in total is around 30.000. R

2条回答
  •  旧时难觅i
    2020-12-10 00:35

    One possible approach is using kernel density estimation (KDE) to build an estimated probability distribution of the data, then sample according to the inverse of the estimated probability density of each point (or some other function that becomes smaller the bigger the estimated probability density is). There are a few tools to compute a (KDE) in Python, a simple one is scipy.stats.gaussian_kde. Here is an example of the idea:

    import numpy as np
    import scipy.stats
    import matplotlib.pyplot as plt
    
    np.random.seed(100)
    # Make some random Gaussian data
    data = np.random.multivariate_normal([1, 1], [[1, 0], [0, 1]], size=1000)
    # Compute KDE
    kde = scipy.stats.gaussian_kde(data.T)
    # Choice probabilities are computed from inverse probability density in KDE
    p = 1 / kde.pdf(data.T)
    # Normalize choice probabilities
    p /= np.sum(p)
    # Make sample using choice probabilities
    idx = np.random.choice(np.arange(len(data)), size=100, replace=False, p=p)
    sample = data[idx]
    # Plot
    plt.figure()
    plt.scatter(data[:, 0], data[:, 1], label='Data', s=10)
    plt.scatter(sample[:, 0], sample[:, 1], label='Sample', s=7)
    plt.legend()
    

    Output:

提交回复
热议问题