Spark Dataframe validating column names for parquet writes (scala)

I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as as Parquet format.

However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ,;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names.

How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job.

[1] Spark's CatalystSchemaConverter

def checkFieldName(name: String): Unit = {
    // ,;{}()\n\t= and space are special characters in Parquet schema
    checkConversionRequirement(
      !name.matches(".*[ ,;{}()\n\t=].*"),
      s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
         |Please use alias to rename it.
       """.stripMargin.split("\n").mkString(" ").trim)
  }

I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.

Sorry but I have only the pyspark code ready:

from pyspark.sql import functions as F

df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)

For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:

file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", ""))

df = spark.read.schema(df.schema).parquet(file)

Using alias to change your field names without those special characters.

Try to use regular expression for replacing of bad symbols. Check my answer.

来源：https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes-scala

标签

apache-spark

apache-spark-sql

spark-streaming

spark-dataframe

parquet