I’m trying to generate simulated student grades in 4 subjects, where a student record is a single row of data. The code shown here will generate normally distributed random numbers with a mean of 60 and a standard deviation of 15.
df = pd.DataFrame(15 * np.random.randn(5, 4) + 60, columns=['Math', 'Science', 'History', 'Art'])
What I can’t figure out is how to make it so that a student’s Science mark is highly correlated to their Math mark, and that their History and Art marks are less so, but still somewhat correlated to the Math mark.
I’m neither a statistician or an expert programmer, so a less sophisticated but more easily understood solution is what I’m hoping for.
Let's put what has been suggested by @Daniel into code.
Step 1
Let's import multivariate_normal
:
import numpy as np
from scipy.stats import multivariate_normal as mvn
Step 2
Let's construct covariance data and generate data:
cov = np.array([[1, 0.8,.7, .6],[.8,1.,.5,.5],[0.7,.5,1.,.5],[0.6,.5,.5,1]])
cov
array([[ 1. , 0.8, 0.7, 0.6],
[ 0.8, 1. , 0.5, 0.5],
[ 0.7, 0.5, 1. , 0.5],
[ 0.6, 0.5, 0.5, 1. ]])
This is the key step. Note, that covariance matrix has 1's
in diagonal, and the covariances decrease as you step from left to right.
Now we are ready to generate data, let's sat 1'000 points:
scores = mvn.rvs(mean = [60.,60.,60.,60.], cov=cov, size = 1000)
Sanity check (from covariance matrix to simple correlations):
np.corrcoef(scores.T):
array([[ 1. , 0.78886583, 0.70198586, 0.56810058],
[ 0.78886583, 1. , 0.49187904, 0.45994833],
[ 0.70198586, 0.49187904, 1. , 0.4755558 ],
[ 0.56810058, 0.45994833, 0.4755558 , 1. ]])
Note, that np.corrcoef
expects your data in rows.
Finally, let's put your data into Pandas' DataFrame
:
df = pd.DataFrame(data = scores, columns = ["Math", "Science","History", "Art"])
df.head()
Math Science History Art
0 60.629673 61.238697 61.805788 61.848049
1 59.728172 60.095608 61.139197 61.610891
2 61.205913 60.812307 60.822623 59.497453
3 60.581532 62.163044 59.277956 60.992206
4 61.408262 59.894078 61.154003 61.730079
Step 3
Let's visualize some data that we've just generated:
ax = df.plot(x = "Math",y="Art", kind="scatter", color = "r", alpha = .5, label = "Art, $corr_{Math}$ = .6")
df.plot(x = "Math",y="Science", kind="scatter", ax = ax, color = "b", alpha = .2, label = "Science, $corr_{Math}$ = .8")
ax.set_ylabel("Art and Science");
Thank you guys for the responses; they were extremely useful. I adapted the code provided by Sergey to produce the result I was looking for, which was records with Math and Science marks that are relatively close most of the time, and History and Art marks that are more independent.
The following produced data that looks reasonable:
cov = np.array([[1, 0.5,.2, .1],[.5,1.,.1,.1],[0.2,.1,1,.3],[0.1,.1,.3,1]])
scores = mvn.rvs(mean = [0.,0.,0.,0.], cov=cov, size = 100)
df = pd.DataFrame(data = 15 * scores + 60, columns = ["Math","Science","History", "Art"])
df.head(10)
The next step would be to make it so that each subject has a different mean, but I have an idea of how to do that. Thanks again.
The statistical tool for that is the covariance matrix: https://en.wikipedia.org/wiki/Covariance. Each cell (i,j) is representing the dependecy between the variable i and the variable j, so in your case it can be between math and science. If there is no dependency the value would be 0.
What you did was assuming that the covariance was a diagonal matrix with the same values on the diagonal. So what you have to do is defines your covariance matrix and afterwards draw the samples from a gaussian with numpy.random.multivariate_normal
https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multivariate_normal.html or any other distribution functions.
来源:https://stackoverflow.com/questions/45952895/generating-correlated-numbers-in-numpy-pandas