Generate correlated data in Python (3.3)

前端 未结 2 1304
臣服心动
臣服心动 2020-12-28 09:51

In R there is a function (cm.rnorm.cor, from package CreditMetrics), that takes the amount of samples, the amount of variables, and a correlation m

2条回答
  •  粉色の甜心
    2020-12-28 10:27

    If you Cholesky-decompose a covariance matrix C into L L^T, and generate an independent random vector x, then Lx will be a random vector with covariance C.

    import numpy as np
    import matplotlib.pyplot as plt
    linalg = np.linalg
    np.random.seed(1)
    
    num_samples = 1000
    num_variables = 2
    cov = [[0.3, 0.2], [0.2, 0.2]]
    
    L = linalg.cholesky(cov)
    # print(L.shape)
    # (2, 2)
    uncorrelated = np.random.standard_normal((num_variables, num_samples))
    mean = [1, 1]
    correlated = np.dot(L, uncorrelated) + np.array(mean).reshape(2, 1)
    # print(correlated.shape)
    # (2, 1000)
    plt.scatter(correlated[0, :], correlated[1, :], c='green')
    plt.show()
    

    enter image description here

    Reference: See Cholesky decomposition


    If you want to generate two series, X and Y, with a particular (Pearson) correlation coefficient (e.g. 0.2):

    rho = cov(X,Y) / sqrt(var(X)*var(Y))
    

    you could choose the covariance matrix to be

    cov = [[1, 0.2],
           [0.2, 1]]
    

    This makes the cov(X,Y) = 0.2, and the variances, var(X) and var(Y) both equal to 1. So rho would equal 0.2.

    For example, below we generate pairs of correlated series, X and Y, 1000 times. Then we plot a histogram of the correlation coefficients:

    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    linalg = np.linalg
    np.random.seed(1)
    
    num_samples = 1000
    num_variables = 2
    cov = [[1.0, 0.2], [0.2, 1.0]]
    
    L = linalg.cholesky(cov)
    
    rhos = []
    for i in range(1000):
        uncorrelated = np.random.standard_normal((num_variables, num_samples))
        correlated = np.dot(L, uncorrelated)
        X, Y = correlated
        rho, pval = stats.pearsonr(X, Y)
        rhos.append(rho)
    
    plt.hist(rhos)
    plt.show()
    

    enter image description here

    As you can see, the correlation coefficients are generally near 0.2, but for any given sample, the correlation will most likely not be 0.2 exactly.

提交回复
热议问题