How to convert RDD of dense vector into DataFrame in pyspark?

前端未结

关注

 2  1759

离开以前 2020-12-09 20:22

I have a DenseVector RDD like this

>>> frequencyDenseVectors.collect()
[DenseVector


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   死守一世寂寞
                                             
                
                
                (楼主)
            
              
              
                2020-12-09 20:58
              

            
            
                        
You cannot convert RDD[Vector] directly. It should be mapped to a RDD of objects which can be interpreted as structs, for example RDD[Tuple[Vector]]:

frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])


Otherwise Spark will try to convert object __dict__ and create use unsupported NumPy array as a field.

from pyspark.ml.linalg import DenseVector  
from pyspark.sql.types import _infer_schema

v = DenseVector([1, 2, 3])
_infer_schema(v)


TypeError                                 Traceback (most recent call last)
... 
TypeError: not supported type: 


vs.

_infer_schema((v, ))


StructType(List(StructField(_1,VectorUDT,true)))


Notes:


In Spark 2.0 you have to use correct local types:


pyspark.ml.linalg when working DataFrame based pyspark.ml API.
pyspark.mllib.linalg when working RDD based pyspark.mllib API.


These two namespaces can no longer compatible and require explicit conversions (for example How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT).
Code provided in the edit is not equivalent to the one from the original question. You should be aware that tuple and list don't have the same semantics. If you map vector to pair use tuple and convert directly to DataFrame:

tfidf.rdd.map(
    lambda row: (row[0], DenseVector(row[1].toArray()))
).toDF()


using tuple (product type) would work for nested structure as well but I doubt this is what you want:

(tfidf.rdd
    .map(lambda row: (row[0], DenseVector(row[1].toArray())))
    .map(lambda x: (x, ))
    .toDF())


list at any other place than the top level row is interpreted as an ArrayType.
It is much cleaner to use an UDF for conversion (Spark Python: Standard scaler error "Do not support ... SparseVector").

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复