问题
I am trying to use spark 2.0.2 to convert a JSON file into parquet.
- The JSON file comes from an external source and therefor the schema can't be changed before it arrives.
- The file contains a map of attributes. The attribute names arn't known before I receive the file.
- The attribute names contain characters that can't be used in parquet.
{
"id" : 1,
"name" : "test",
"attributes" : {
"name=attribute" : 10,
"name=attribute with space" : 100,
"name=something else" : 10
}
}
Both the space and equals character can't be used in parquet, I get the following error:
org.apache.spark.sql.AnalysisException: Attribute name "name=attribute" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
- As these are nested fields I can't rename them using an alias, is this true?
- I have tried renaming the fields within the schema as suggested here: How to rename fields in an DataFrame corresponding to nested JSON. This works for some files, However, I now get the following stackoverflow:
java.lang.StackOverflowError at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:65) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:258) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1563) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1576) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576) at scala.collection.immutable.List.foreach(List.scala:381) ... repeat ...
I want to do one of the following:
- Strip invalid characters from the field names as I load the data into spark
- Change the column names in the schema without causing stack overflows
- Somehow change the schema to load the original data but use the following internally:
{
"id" : 1,
"name" : "test",
"attributes" : [
{"key":"name=attribute", "value" : 10},
{"key":"name=attribute with space", "value" : 100},
{"key":"name=something else", "value" : 10}
]
}
回答1:
I solved the problem this way:
df.toDF(df
.schema
.fieldNames
.map(name => "[ ,;{}()\\n\\t=]+".r.replaceAllIn(name, "_")): _*)
where I replaced all incorrect symbols by "_".
回答2:
The only solution I have found to work,so far, is to reload the data with a modified schema. The new schema will load the attributes into a map.
Dataset<Row> newData = sql.read().json(path);
StructType newSchema = (StructType) toMapType(newData.schema(), null, "attributes");
newData = sql.read().schema(newSchema).json(path);
private DataType toMapType(DataType dataType, String fullColName, String col) {
if (dataType instanceof StructType) {
StructType structType = (StructType) dataType;
List<StructField> renamed = Arrays.stream(structType.fields()).map(
f -> toMapType(f, fullColName == null ? f.name() : fullColName + "." + f.name(), col)).collect(Collectors.toList());
return new StructType(renamed.toArray(new StructField[renamed.size()]));
}
return dataType;
}
private StructField toMapType(StructField structField, String fullColName, String col) {
if (fullColName.equals(col)) {
return new StructField(col, new MapType(DataTypes.StringType, DataTypes.LongType, true), true, Metadata.empty());
} else if (col.startsWith(fullColName)) {
return new StructField(structField.name(), toMapType(structField.dataType(), fullColName, col), structField.nullable(), structField.metadata());
}
return structField;
}
回答3:
I have the same problem with @:
.
In our case, we solved flattering the DataFrame.
val ALIAS_RE: Regex = "[_.:@]+".r
val FIRST_AT_RE: Regex = "^_".r
def getFieldAlias(field_name: String): String = {
FIRST_AT_RE.replaceAllIn(ALIAS_RE.replaceAllIn(field_name, "_"), "")
}
def selectFields(df: DataFrame, fields: List[String]): DataFrame = {
var fields_to_select = List[Column]()
for (field <- fields) {
val alias = getFieldAlias(field)
fields_to_select +:= col(field).alias(alias)
}
df.select(fields_to_select: _*)
}
So the following json:
{
object: 'blabla',
schema: {
@type: 'blabla',
name@id: 'blabla'
}
}
That will be transformed [object, schema.@type, schema.name@id]. @ and dots (in your case =) will create problems for SparkSQL.
So after our SelectFields you can end with [object, schema_type, schema_name_id]. Flattered DataFrame.
来源:https://stackoverflow.com/questions/41640260/exporting-nested-fields-with-invalid-characters-from-spark-2-to-parquet