Linear regression in NumPy with very large matrices - how to save memory?

后端 未结 3 989
时光取名叫无心
时光取名叫无心 2021-02-06 16:42

So I have these ginormous matrices X and Y. X and Y both have 100 million rows, and X has 10 columns. I\'m trying to implement linear regression with these matrices, and I nee

3条回答
  •  我寻月下人不归
    2021-02-06 17:26

    the size of X is 100e6 x 10 the size of Y is 100e6 x 1

    so the final size of (X^T*X)^-1 * X^T * Y is 10 x 1

    you can calculate it by following step:

    1. calculate a = X^T*X -> 10 x 10
    2. calculate b = X^T*Y -> 10 x 1
    3. calculate a^-1 * b

    matrixs in step 3 is very small, so you just need to do some intermediate steps to calculate 1 & 2.

    For example you can read column 0 of X and Y, and calculate it by numpy.dot(X0, Y).

    for float64 dtype, the size of X0 and Y is about 1600M, if it cann't fit the memory, you can call numpy.dot twice for the first half and second half of X0 & Y separately.

    So to calculate X^T*Y you need call numpy.dot 20 times, to calculate X^T*X you need call numpy.dot 200 times.

提交回复
热议问题