Dynamically rename multiple columns in PySpark DataFrame

前端未结

关注

 4  2094

I have a dataframe in pyspark which has 15 columns.

The column name are id, name, emp.dno, emp.sal, state


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2020-12-14 05:15
              
            
            
                                                                       
You can use something similar to this great solution from @zero323:

df.toDF(*(c.replace('.', '_') for c in df.columns))


alternatively:

from pyspark.sql.functions import col

replacements = {c:c.replace('.','_') for c in df.columns if '.' in c}

df.select([col(c).alias(replacements.get(c, c)) for c in df.columns])


The replacement dictionary then would look like:

{'emp.city': 'emp_city', 'emp.dno': 'emp_dno', 'emp.sal': 'emp_sal'}


UPDATE:


  if I have dataframe with space in column names also how do replace
  both '.' and space with '_'


import re

df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  面向向阳花        
                
              
                            
                2020-12-14 05:20
              
            
            
                                                                       
Wrote an easy & fast function for you to use. Enjoy! :)

def rename_cols(rename_df):
    for column in rename_df.columns:
        new_column = column.replace('.','_')
        rename_df = rename_df.withColumnRenamed(column, new_column)
    return rename_df

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  故里飘歌        
                
              
                            
                2020-12-14 05:29
              
            
            
                                                                       
MaxU's answer is good and efficient.  This post outlines another approach that's also efficient and helps keep your codebase clean (using the quinn library).
Suppose you have the following DataFrame:
+---+-----+--------+-------+
| id| name|emp.city|emp.sal|
+---+-----+--------+-------+
| 12|  bob|New York|     80|
| 99|alice| Atlanta|     90|
+---+-----+--------+-------+

Here's how you can replace the dots with underscores in all the columns.
import quinn

def dots_to_underscores(s):
    return s.replace('.', '_')
actual_df = df.transform(quinn.with_columns_renamed(dots_to_underscores))
actual_df.show()

Here's the resulting actual_df:
+---+-----+--------+-------+
| id| name|emp_city|emp_sal|
+---+-----+--------+-------+
| 12|  bob|New York|     80|
| 99|alice| Atlanta|     90|
+---+-----+--------+-------+

Let's use explain() to verify that this function is executing efficiently:
actual_df.explain(True)

Here's the logical plans that are outputted:
== Parsed Logical Plan ==
'Project ['id AS id#50, 'name AS name#51, '`emp.city` AS emp_city#52, '`emp.sal` AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#29 AS id#50, name#30 AS name#51, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Optimized Logical Plan ==
Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Physical Plan ==
*(1) Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]

You can see that the parsed logical plan is almost identical to the physical plan, so the Catalyst optimizer doesn't need to do much optimization work.  It's converting id AS id#50 to id#29, but that's not too much work.
The with_some_columns_renamed method generates an even more efficient parsed plan.
def dots_to_underscores(s):
    return s.replace('.', '_')
def change_col_name(s):
  return '.' in s
actual_df = df.transform(quinn.with_some_columns_renamed(dots_to_underscores, change_col_name))
actual_df.explain(True)

This parsed plan only aliases the columns with dots.
== Parsed Logical Plan ==
'Project [unresolvedalias('id, None), unresolvedalias('name, None), '`emp.city` AS emp_city#42, '`emp.sal` AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Optimized Logical Plan ==
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Physical Plan ==
*(1) Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]

Read this blog post for more information why looping over the DataFrame and calling withColumnRenamed multiple times creates overly complex parsed plans and should be avoided.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  日久生厌        
                
              
                            
                2020-12-14 05:31
              
            
            
                                                                       
Easiest way to do this is as follows:

Explanation: 


Get all columns in the pyspark dataframe using df.columns
Create a list looping through each column from step 1
The list will output:col("col.1").alias(c.replace('.',"_").Do this only for the required columns. Replace function helps to replace any pattern. Also, you can exclude a few columns from being renamed  
*[list] will unpack the list for select statement in pypsark 



from pyspark.sql import functions as F
(df
 .select(*[F.col(c).alias(c.replace('.',"_")) for c in df.columns])
 .toPandas().head()
)

Hope this helps
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复