Python PCA on Matrix too large to fit into memory

后端未结

关注

 2  1248

臣服心动 2020-12-20 18:10

I have a csv that is 100,000 rows x 27,000 columns that I am trying to do PCA on to produce a 100,000 rows X 300 columns matrix. The csv is 9GB large. Here is currently what

2条回答

被撕碎了的回忆 (楼主)

2020-12-20 19:00

Try to divide your data or load it by batches into script, and fit your PCA with Incremetal PCA with it's partial_fit method on every batch.

from sklearn.decomposition import IncrementalPCA
import csv
import sys
import numpy as np
import pandas as pd

dataset = sys.argv[1]
chunksize_ = 5 * 25000
dimensions = 300

reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_)
sklearn_pca = IncrementalPCA(n_components=dimensions)
for chunk in reader:
    y = chunk.pop("Y")
    sklearn_pca.partial_fit(chunk)

# Computed mean per feature
mean = sklearn_pca.mean_
# and stddev
stddev = np.sqrt(sklearn_pca.var_)

Xtransformed = None
for chunk in pd.read_csv(dataset, sep = ',', chunksize = chunksize_):
    y = chunk.pop("Y")
    Xchunk = sklearn_pca.transform(chunk)
    if Xtransformed == None:
        Xtransformed = Xchunk
    else:
        Xtransformed = np.vstack((Xtransformed, Xchunk))

Useful link

0 讨论(0)

查看其它2个回答