pdist for theano tensor

前端未结

关注

 2  444

花落未央 2020-12-17 02:08

I have a theano symbolic matrix

x = T.fmatrix(\'input\')

x will be later on populated by n vectors of dim d


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   心在旅途
                                             
                
                
                (楼主)
            
              
              
                2020-12-17 02:51
              

            
            
                        
pdist from scipy is a collection of different functions - there doesn't exist a Theano equivalent for all of them at once. However, each specific distance, being a closed form mathematical expression, can be written down in Theano as such and then compiled.

Take as a example the minkowski p norm distance (copy+pasteable):

import theano
import theano.tensor as T
X = T.fmatrix('X')
Y = T.fmatrix('Y')
P = T.scalar('P')
translation_vectors = X.reshape((X.shape[0], 1, -1)) - Y.reshape((1, Y.shape[0], -1))
minkowski_distances = (abs(translation_vectors) ** P).sum(2) ** (1. / P)
f_minkowski = theano.function([X, Y, P], minkowski_distances)


Note that abs calls the built-in __abs__, so abs is also a theano function. We can now compare this to pdist:

import numpy as np
from scipy.spatial.distance import pdist

rng = np.random.RandomState(42)
d = 20 # dimension
nX = 10
nY = 30
x = rng.randn(nX, d).astype(np.float32)
y = rng.randn(nY, d).astype(np.float32)

ps = [1., 3., 2.]

for p in ps:
    d_theano = f_minkowski(x, x, p)[np.triu_indices(nX, 1)]
    d_scipy = pdist(x, p=p, metric='minkowski')
    print "Testing p=%1.2f, discrepancy %1.3e" % (p, np.sqrt(((d_theano - d_scipy) ** 2).sum()))


This yields

Testing p=1.00, discrepancy 1.322e-06
Testing p=3.00, discrepancy 4.277e-07
Testing p=2.00, discrepancy 4.789e-07


As you can see, the correspondence is there, but the function f_minkowski is slightly more general, since it compares the lines of two possibly different arrays. If twice the same array is passed as input, f_minkowski returns a matrix, whereas pdist returns a list without redundancy. If this behaviour is desired, it can also be implemented fully dynamically, but I will stick to the general case here.

One possibility of specialization should be noted though: In the case of p=2, the calculations become simpler through the binomial formula, and this can be used to save precious space in memory: Whereas the general Minkowski distance, as implemented above, creates a 3D array (due to avoidance of for-loops and summing cumulatively), which is prohibitive, depending on the dimension d (and nX, nY), for p=2 we can write

squared_euclidean_distances = (X ** 2).sum(1).reshape((X.shape[0], 1)) + (Y ** 2).sum(1).reshape((1, Y.shape[0])) - 2 * X.dot(Y.T)
f_euclidean = theano.function([X, Y], T.sqrt(squared_euclidean_distances))


which only uses O(nX * nY) space instead of O(nX * nY * d) We check for correspondence, this time on the general problem:

d_eucl = f_euclidean(x, y)
d_minkowski2 = f_minkowski(x, y, 2.)
print "Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy %1.3e" % ((d_eucl - d_minkowski2) ** 2).sum()


yielding

Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy 1.464e-11

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复