Spark Dataframe validating column names for parquet writes (scala)

前端未结

关注

 6  570

I\'m processing events using Dataframes converted from a stream of JSON events which eventually gets written out as as Parquet format.

However, some of the JSON eve

相关标签:

6条回答

鱼传尺愫

2020-12-01 22:02

Try to use regular expression for replacing of bad symbols. Check my answer.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-12-01 22:05
I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.

Sorry but I have only the pyspark code ready:
```
from pyspark.sql import functions as F

df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2020-12-01 22:13
For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:
```
file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", ""))

df = spark.read.schema(df.schema).parquet(file)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

傲寒

2020-12-01 22:14

For pyspark users who encounter this error, you can use this function to normalize column names before writing to parquet file :

import re

# invalid characters in parquet column names are replaced by _
def canonical(x): return re.sub("[ ,;{}()\n\t=]+", '_', x.lower())

renamed_cols = [canonical(c) for c in df.columns]
df = df.toDF(*renamed_cols)

If you want to remove accents from column names too, you can use:

import unicodedata

# strips accents from string value
def strip_accent(x): return unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore').decode()

Putting all together:

renamed_cols = [strip_accent(canonical(c)) for c in df.columns]
df = df.toDF(*renamed_cols)

0 讨论(0)

面向向阳花

2020-12-01 22:18
This is my solution using Regex in order to rename all the dataframe's columns following the parquet convention :
```
df.columns.foldLeft(df){
  case (currentDf,  oldColumnName) => currentDf.withColumnRenamed(oldColumnName, oldColumnName.replaceAll("[ ,;{}()\n\t=]", ""))
}
```
I hope it helps,
0 讨论(0)
发布评论:

提交评论
- 加载中...
离开以前

2020-12-01 22:22

Using alias to change your field names without those special characters.

0 讨论(0)
发布评论:

提交评论
- 加载中...