Linear regression in NumPy with very large matrices - how to save memory?

后端未结

关注

 3  989

时光取名叫无心 2021-02-06 16:42

So I have these ginormous matrices X and Y. X and Y both have 100 million rows, and X has 10 columns. I\'m trying to implement linear regression with these matrices, and I nee

3条回答

我寻月下人不归 (楼主)

2021-02-06 17:26
the size of X is 100e6 x 10 the size of Y is 100e6 x 1

so the final size of (X^T*X)^-1 * X^T * Y is 10 x 1

you can calculate it by following step:
1. calculate a = X^T*X -> 10 x 10
2. calculate b = X^T*Y -> 10 x 1
3. calculate a^-1 * b
matrixs in step 3 is very small, so you just need to do some intermediate steps to calculate 1 & 2.

For example you can read column 0 of X and Y, and calculate it by numpy.dot(X0, Y).

for float64 dtype, the size of X0 and Y is about 1600M, if it cann't fit the memory, you can call numpy.dot twice for the first half and second half of X0 & Y separately.

So to calculate X^T*Y you need call numpy.dot 20 times, to calculate X^T*X you need call numpy.dot 200 times.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...