I have a dataframe in pyspark which has 15 columns.
The column name are id, name, emp.dno, emp.sal, state
You can use something similar to this great solution from @zero323:
df.toDF(*(c.replace('.', '_') for c in df.columns))
alternatively:
from pyspark.sql.functions import col
replacements = {c:c.replace('.','_') for c in df.columns if '.' in c}
df.select([col(c).alias(replacements.get(c, c)) for c in df.columns])
The replacement dictionary then would look like:
{'emp.city': 'emp_city', 'emp.dno': 'emp_dno', 'emp.sal': 'emp_sal'}
UPDATE:
if I have dataframe with space in column names also how do replace both
'.'and space with'_'
import re
df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns))
Wrote an easy & fast function for you to use. Enjoy! :)
def rename_cols(rename_df):
for column in rename_df.columns:
new_column = column.replace('.','_')
rename_df = rename_df.withColumnRenamed(column, new_column)
return rename_df
MaxU's answer is good and efficient. This post outlines another approach that's also efficient and helps keep your codebase clean (using the quinn library).
Suppose you have the following DataFrame:
+---+-----+--------+-------+
| id| name|emp.city|emp.sal|
+---+-----+--------+-------+
| 12| bob|New York| 80|
| 99|alice| Atlanta| 90|
+---+-----+--------+-------+
Here's how you can replace the dots with underscores in all the columns.
import quinn
def dots_to_underscores(s):
return s.replace('.', '_')
actual_df = df.transform(quinn.with_columns_renamed(dots_to_underscores))
actual_df.show()
Here's the resulting actual_df:
+---+-----+--------+-------+
| id| name|emp_city|emp_sal|
+---+-----+--------+-------+
| 12| bob|New York| 80|
| 99|alice| Atlanta| 90|
+---+-----+--------+-------+
Let's use explain() to verify that this function is executing efficiently:
actual_df.explain(True)
Here's the logical plans that are outputted:
== Parsed Logical Plan ==
'Project ['id AS id#50, 'name AS name#51, '`emp.city` AS emp_city#52, '`emp.sal` AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false
== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#29 AS id#50, name#30 AS name#51, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false
== Optimized Logical Plan ==
Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false
== Physical Plan ==
*(1) Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
You can see that the parsed logical plan is almost identical to the physical plan, so the Catalyst optimizer doesn't need to do much optimization work. It's converting id AS id#50 to id#29, but that's not too much work.
The with_some_columns_renamed method generates an even more efficient parsed plan.
def dots_to_underscores(s):
return s.replace('.', '_')
def change_col_name(s):
return '.' in s
actual_df = df.transform(quinn.with_some_columns_renamed(dots_to_underscores, change_col_name))
actual_df.explain(True)
This parsed plan only aliases the columns with dots.
== Parsed Logical Plan ==
'Project [unresolvedalias('id, None), unresolvedalias('name, None), '`emp.city` AS emp_city#42, '`emp.sal` AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false
== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false
== Optimized Logical Plan ==
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false
== Physical Plan ==
*(1) Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
Read this blog post for more information why looping over the DataFrame and calling withColumnRenamed multiple times creates overly complex parsed plans and should be avoided.
Easiest way to do this is as follows:
Explanation:
from pyspark.sql import functions as F
(df
.select(*[F.col(c).alias(c.replace('.',"_")) for c in df.columns])
.toPandas().head()
)
Hope this helps