Dynamically rename multiple columns in PySpark DataFrame

夙愿已清 提交于 2020-07-31 09:00:39

问题


I have a dataframe in pyspark which has 15 columns.

The column name are id, name, emp.dno, emp.sal, state, emp.city, zip .....

Now I want to replace the column names which have '.' in them to '_'

Like 'emp.dno' to 'emp_dno'

I would like to do it dynamically

How can I achieve that in pyspark?


回答1:


You can use something similar to this great solution from @zero323:

df.toDF(*(c.replace('.', '_') for c in df.columns))

alternatively:

from pyspark.sql.functions import col

replacements = {c:c.replace('.','_') for c in df.columns if '.' in c}

df.select([col(c).alias(replacements.get(c, c)) for c in df.columns])

The replacement dictionary then would look like:

{'emp.city': 'emp_city', 'emp.dno': 'emp_dno', 'emp.sal': 'emp_sal'}

UPDATE:

if I have dataframe with space in column names also how do replace both '.' and space with '_'

import re

df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns))



回答2:


Wrote an easy & fast function for you to use. Enjoy! :)

def rename_cols(rename_df):
    for column in rename_df.columns:
        new_column = column.replace('.','_')
        rename_df = rename_df.withColumnRenamed(column, new_column)
    return rename_df



回答3:


Easiest way to do this is as follows:

Explanation:

  1. Get all columns in the pyspark dataframe using df.columns
  2. Create a list looping through each column from step 1
  3. The list will output:col("col.1").alias(c.replace('.',"_").Do this only for the required columns. Replace function helps to replace any pattern. Also, you can exclude a few columns from being renamed
  4. *[list] will unpack the list for select statement in pypsark

from pyspark.sql import functions as F (df .select(*[F.col(c).alias(c.replace('.',"_")) for c in df.columns]) .toPandas().head() )

Hope this helps




回答4:


MaxU's answer is good and efficient. This post outlines another approach that's also efficient and helps keep your codebase clean (using the quinn library).

Suppose you have the following DataFrame:

+---+-----+--------+-------+
| id| name|emp.city|emp.sal|
+---+-----+--------+-------+
| 12|  bob|New York|     80|
| 99|alice| Atlanta|     90|
+---+-----+--------+-------+

Here's how you can replace the dots with underscores in all the columns.

import quinn

def dots_to_underscores(s):
    return s.replace('.', '_')
actual_df = df.transform(quinn.with_columns_renamed(dots_to_underscores))
actual_df.show()

Here's the resulting actual_df:

+---+-----+--------+-------+
| id| name|emp_city|emp_sal|
+---+-----+--------+-------+
| 12|  bob|New York|     80|
| 99|alice| Atlanta|     90|
+---+-----+--------+-------+

Let's use explain() to verify that this function is executing efficiently:

actual_df.explain(True)

Here's the logical plans that are outputted:

== Parsed Logical Plan ==
'Project ['id AS id#50, 'name AS name#51, '`emp.city` AS emp_city#52, '`emp.sal` AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#29 AS id#50, name#30 AS name#51, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Optimized Logical Plan ==
Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Physical Plan ==
*(1) Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]

You can see that the parsed logical plan is almost identical to the physical plan, so the Catalyst optimizer doesn't need to do much optimization work. It's converting id AS id#50 to id#29, but that's not too much work.

The with_some_columns_renamed method generates an even more efficient parsed plan.

def dots_to_underscores(s):
    return s.replace('.', '_')
def change_col_name(s):
  return '.' in s
actual_df = df.transform(quinn.with_some_columns_renamed(dots_to_underscores, change_col_name))
actual_df.explain(True)

This parsed plan only aliases the columns with dots.

== Parsed Logical Plan ==
'Project [unresolvedalias('id, None), unresolvedalias('name, None), '`emp.city` AS emp_city#42, '`emp.sal` AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Optimized Logical Plan ==
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Physical Plan ==
*(1) Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]

Read this blog post for more information why looping over the DataFrame and calling withColumnRenamed multiple times creates overly complex parsed plans and should be avoided.



来源:https://stackoverflow.com/questions/41655158/dynamically-rename-multiple-columns-in-pyspark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!