I currently want to calculate all-pair document similarity using cosine similarity and Tfidf features in python. My basic approach is the following:
from sklearn
You may want to look at the random_projection
module in scikit-learn. The Johnson-Lindenstrauss lemma says that a random projection matrix is guaranteed to preserve pairwise distances up to some tolerance eta
, which is a hyperparameter when you calculate the number of random projections needed.
To cut a long story short, the scikit-learn class SparseRandomProjection
seen here is a transformer to do this for you. If you run it on X after vec.fit_transform
you should see a fairly large reduction in feature size.
The formula from sklearn.random_projection.johnson_lindenstrauss_min_dim
shows that to preserve up to a 10% tolerance, you only need johnson_lindenstrauss_min_dim(350363, .1)
10942 features. This is an upper bound, so you may be able to get away with much less. Even 1% tolerance would only need johnson_lindenstrauss_min_dim(350363, .01)
1028192 features which is still significantly less than you have right now.
EDIT: Simple thing to try - if your data is dtype='float64', try using 'float32'. That alone can save a massive amount of space, especially if you do not need double precision.
If the issue is that you cannot store the "final matrix" in memory either, I would recommend working with the data in an HDF5Store (as seen in pandas using pytables). This link has some good starter code, and you could iteratively calculate chunks of your dot product and write to disk. I have been using this extensively in a recent project on a 45GB dataset, and could provide more help if you decide to go this route.