I\'m processing events using Dataframes converted from a stream of JSON events which eventually gets written out as as Parquet format.
However, some of the JSON eve
Try to use regular expression for replacing of bad symbols. Check my answer.
I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.
Sorry but I have only the pyspark code ready:
from pyspark.sql import functions as F
df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)
For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:
file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(df.schema).parquet(file)
For pyspark users who encounter this error, you can use this function to normalize column names before writing to parquet file :
import re
# invalid characters in parquet column names are replaced by _
def canonical(x): return re.sub("[ ,;{}()\n\t=]+", '_', x.lower())
renamed_cols = [canonical(c) for c in df.columns]
df = df.toDF(*renamed_cols)
If you want to remove accents from column names too, you can use:
import unicodedata
# strips accents from string value
def strip_accent(x): return unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore').decode()
Putting all together:
renamed_cols = [strip_accent(canonical(c)) for c in df.columns]
df = df.toDF(*renamed_cols)
This is my solution using Regex in order to rename all the dataframe's columns following the parquet convention :
df.columns.foldLeft(df){
case (currentDf, oldColumnName) => currentDf.withColumnRenamed(oldColumnName, oldColumnName.replaceAll("[ ,;{}()\n\t=]", ""))
}
I hope it helps,
Using alias
to change your field names without those special characters.