Python: computing pariwise distances causes memory error

做~自己de王妃 提交于 2020-01-02 09:36:45

问题


I want to compute the pairwise distances of 57832 vectors. Each vector has 200 dimensions. I am using pdist to compute the distances.

from scipy.spatial.distance import pdist
pairwise_distances = pdist(X, 'cosine')
# pdist is supposed to return a numpy array with shape (57832*57831,).

However, this causes a memory error.

   Traceback (most recent call last):
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/main.py", line 101, in <module>
    result_clustering = clf_clustering.getCVResult(shuffle)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 158, in getCVResult
    self.centroids_of_categories(X_train, y_train)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 103, in centroids_of_categories
    cat_centroids.append( self.dpc.centroids(X_in_this_cat) )
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 23, in centroids
    distance_dict, rho_dict = self.compute_distances_and_rhos(X)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 59, in compute_distances_and_rhos
    pairwise_distances = pdist(X, 'cosine')
  File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1185, in pdist
    dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError

The RAM of my laptop is 16GB. How should I fix it? Or is there any better way?


回答1:


Doing matrix-based algorithms on large data sets is prohibitive.

The memory requirements are straightforward to estimate. Even with exploiting symmetry, many implementations will max out at about 65000 instances. But even 64 bit implementations and big machines will eventually give up. A 1000000x1000000 matrix with double precision and exploiting symmetry needs 4 TB of RAM.

Use better algorithms that don't need O(n^2) memory and runtime.



来源:https://stackoverflow.com/questions/31320409/python-computing-pariwise-distances-causes-memory-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!