Generate correlated data in Python (3.3)

前端未结

关注

 2  1304

臣服心动 2020-12-28 09:51

In R there is a function (cm.rnorm.cor, from package CreditMetrics), that takes the amount of samples, the amount of variables, and a correlation m

2条回答

粉色の甜心 (楼主)

2020-12-28 10:27

If you Cholesky-decompose a covariance matrix C into L L^T, and generate an independent random vector x, then Lx will be a random vector with covariance C.

import numpy as np
import matplotlib.pyplot as plt
linalg = np.linalg
np.random.seed(1)

num_samples = 1000
num_variables = 2
cov = [[0.3, 0.2], [0.2, 0.2]]

L = linalg.cholesky(cov)
# print(L.shape)
# (2, 2)
uncorrelated = np.random.standard_normal((num_variables, num_samples))
mean = [1, 1]
correlated = np.dot(L, uncorrelated) + np.array(mean).reshape(2, 1)
# print(correlated.shape)
# (2, 1000)
plt.scatter(correlated[0, :], correlated[1, :], c='green')
plt.show()

enter image description here

Reference: See Cholesky decomposition

If you want to generate two series, X and Y, with a particular (Pearson) correlation coefficient (e.g. 0.2):

rho = cov(X,Y) / sqrt(var(X)*var(Y))

you could choose the covariance matrix to be

cov = [[1, 0.2],
       [0.2, 1]]

This makes the cov(X,Y) = 0.2, and the variances, var(X) and var(Y) both equal to 1. So rho would equal 0.2.

For example, below we generate pairs of correlated series, X and Y, 1000 times. Then we plot a histogram of the correlation coefficients:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
linalg = np.linalg
np.random.seed(1)

num_samples = 1000
num_variables = 2
cov = [[1.0, 0.2], [0.2, 1.0]]

L = linalg.cholesky(cov)

rhos = []
for i in range(1000):
    uncorrelated = np.random.standard_normal((num_variables, num_samples))
    correlated = np.dot(L, uncorrelated)
    X, Y = correlated
    rho, pval = stats.pearsonr(X, Y)
    rhos.append(rho)

plt.hist(rhos)
plt.show()

enter image description here

As you can see, the correlation coefficients are generally near 0.2, but for any given sample, the correlation will most likely not be 0.2 exactly.

0 讨论(0)

查看其它2个回答