Spark dataframe: collect () vs select ()

前端未结
关注
 6  476
情话喂你 2020-12-13 06:33
Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that.
Will collect()

      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   自闭症患者
                                             
                
                
                (楼主)
            
              
              
                2020-12-13 07:10
              

            
            
                        
Actions vs Transformations


  
  Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or
  other operation that returns a sufficiently small subset of the data.
  


spark-sql doc


  select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame.
  
  Parameters:   cols – list of column names (string) or expressions
  (Column). If one of the column names is ‘*’, that column is expanded
  to include all columns in the current DataFrame.**

df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]



Execution select(column-name1,column-name2,etc) method on a dataframe, returns a new dataframe which holds only the columns which were selected in the select() function.

e.g. assuming df has several columns including "name" and "value" and some others.

df2 = df.select("name","value")


df2 will hold only two columns ("name" and "value") out of the entire columns of df

df2 as the result of select will be in the executors and not in the driver (as in the case of using collect())

sql-programming-guide

df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)

# Select only the "name" column
df.select("name").show()
# +-------+
# |   name|
# +-------+
# |Michael|
# |   Andy|
# | Justin|
# +-------+


You can running collect() on a dataframe (spark docs)

>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]


spark docs


  To print all elements on the driver, one can use the collect() method
  to first bring the RDD to the driver node thus:
  rdd.collect().foreach(println). This can cause the driver to run out
  of memory, though, because collect() fetches the entire RDD to a
  single machine; if you only need to print a few elements of the RDD, a
  safer approach is to use the take(): rdd.take(100).foreach(println).

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复