I have a csv that is 100,000 rows x 27,000 columns that I am trying to do PCA on to produce a 100,000 rows X 300 columns matrix. The csv is 9GB large. Here is currently what
Try to divide your data or load it by batches into script, and fit your PCA with Incremetal PCA with it's partial_fit method on every batch.
from sklearn.decomposition import IncrementalPCA
import csv
import sys
import numpy as np
import pandas as pd
dataset = sys.argv[1]
chunksize_ = 5 * 25000
dimensions = 300
reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_)
sklearn_pca = IncrementalPCA(n_components=dimensions)
for chunk in reader:
y = chunk.pop("Y")
sklearn_pca.partial_fit(chunk)
# Computed mean per feature
mean = sklearn_pca.mean_
# and stddev
stddev = np.sqrt(sklearn_pca.var_)
Xtransformed = None
for chunk in pd.read_csv(dataset, sep = ',', chunksize = chunksize_):
y = chunk.pop("Y")
Xchunk = sklearn_pca.transform(chunk)
if Xtransformed == None:
Xtransformed = Xchunk
else:
Xtransformed = np.vstack((Xtransformed, Xchunk))
Useful link