Spark Dataframe validating column names for parquet writes (scala)

纵饮孤独 提交于 2019-11-27 06:54:33

问题


I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as as Parquet format.

However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ,;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names.

How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job.

[1] Spark's CatalystSchemaConverter

def checkFieldName(name: String): Unit = {
    // ,;{}()\n\t= and space are special characters in Parquet schema
    checkConversionRequirement(
      !name.matches(".*[ ,;{}()\n\t=].*"),
      s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
         |Please use alias to rename it.
       """.stripMargin.split("\n").mkString(" ").trim)
  }

回答1:


For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:

file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", ""))

df = spark.read.schema(df.schema).parquet(file)




回答2:


I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.

Sorry but I have only the pyspark code ready:

from pyspark.sql import functions as F

df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)



回答3:


Using alias to change your field names without those special characters.




回答4:


Try to use regular expression for replacing of bad symbols. Check my answer.



来源:https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!