Exporting nested fields with invalid characters from Spark 2 to Parquet [duplicate]

问题

I am trying to use spark 2.0.2 to convert a JSON file into parquet.

The JSON file comes from an external source and therefor the schema can't be changed before it arrives.
The file contains a map of attributes. The attribute names arn't known before I receive the file.
The attribute names contain characters that can't be used in parquet.

{
    "id" : 1,
    "name" : "test",
    "attributes" : {
        "name=attribute" : 10,
        "name=attribute with space" : 100,
        "name=something else" : 10
    }
}

Both the space and equals character can't be used in parquet, I get the following error:

 org.apache.spark.sql.AnalysisException: Attribute name "name=attribute" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

As these are nested fields I can't rename them using an alias, is this true?
I have tried renaming the fields within the schema as suggested here: How to rename fields in an DataFrame corresponding to nested JSON. This works for some files, However, I now get the following stackoverflow:

java.lang.StackOverflowError 

at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:65) 
at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:258) 
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1563) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) 
at scala.collection.immutable.List.foreach(List.scala:381) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576) 
at scala.collection.immutable.List.foreach(List.scala:381) 
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1576) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) 
at scala.collection.immutable.List.foreach(List.scala:381) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576) 
at scala.collection.immutable.List.foreach(List.scala:381) 
...
repeat
...

I want to do one of the following:

Strip invalid characters from the field names as I load the data into spark
Change the column names in the schema without causing stack overflows
Somehow change the schema to load the original data but use the following internally:

{
    "id" : 1,
    "name" : "test",
    "attributes" : [
        {"key":"name=attribute", "value" : 10},
        {"key":"name=attribute with space", "value"  : 100},
        {"key":"name=something else", "value" : 10}
    ]
}

回答1:

I solved the problem this way:

df.toDF(df
    .schema
    .fieldNames
    .map(name => "[ ,;{}()\\n\\t=]+".r.replaceAllIn(name, "_")): _*)

where I replaced all incorrect symbols by "_".

回答2:

The only solution I have found to work,so far, is to reload the data with a modified schema. The new schema will load the attributes into a map.

Dataset<Row> newData = sql.read().json(path);
StructType newSchema = (StructType) toMapType(newData.schema(), null, "attributes");
newData = sql.read().schema(newSchema).json(path);

private DataType toMapType(DataType dataType, String fullColName, String col) {
    if (dataType instanceof StructType) {
        StructType structType = (StructType) dataType;

        List<StructField> renamed = Arrays.stream(structType.fields()).map(
            f -> toMapType(f, fullColName == null ? f.name() : fullColName + "." + f.name(), col)).collect(Collectors.toList());
        return new StructType(renamed.toArray(new StructField[renamed.size()]));
    }
    return dataType;
}

private StructField toMapType(StructField structField, String fullColName, String col) {
    if (fullColName.equals(col)) {
        return new StructField(col, new MapType(DataTypes.StringType, DataTypes.LongType, true), true, Metadata.empty());
    } else if (col.startsWith(fullColName)) {
        return new StructField(structField.name(), toMapType(structField.dataType(), fullColName, col), structField.nullable(), structField.metadata());
    }
    return structField;

}

回答3:

I have the same problem with @:.

In our case, we solved flattering the DataFrame.

  val ALIAS_RE: Regex = "[_.:@]+".r
  val FIRST_AT_RE: Regex = "^_".r

  def getFieldAlias(field_name: String): String = {
    FIRST_AT_RE.replaceAllIn(ALIAS_RE.replaceAllIn(field_name, "_"), "")
  }

  def selectFields(df: DataFrame, fields: List[String]): DataFrame = {
    var fields_to_select = List[Column]()
    for (field <- fields) {
      val alias = getFieldAlias(field)
      fields_to_select +:= col(field).alias(alias)
    }

    df.select(fields_to_select: _*)
  }

So the following json:

{ 
  object: 'blabla',
  schema: {
    @type: 'blabla',
    name@id: 'blabla'
  }
}

That will be transformed [object, schema.@type, schema.name@id]. @ and dots (in your case =) will create problems for SparkSQL.

So after our SelectFields you can end with [object, schema_type, schema_name_id]. Flattered DataFrame.

来源：https://stackoverflow.com/questions/41640260/exporting-nested-fields-with-invalid-characters-from-spark-2-to-parquet

标签

apache-spark

spark-dataframe

parquet