Spark Exception Complex types not supported while loading parquet

问题

I am trying to load Parquet File in Spark as dataframe-

val df= spark.read.parquet(path)

I am getting -

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 12, 10.250.2.32): java.lang.UnsupportedOperationException: Complex types not supported.

While going through the code, I realized there is a check in spark VectorizedParquetRecordReader.java (initializeInternal)-

Type t = requestedSchema.getFields().get(i);
  if (!t.isPrimitive() || t.isRepetition(Type.Repetition.REPEATED)) {
throw new UnsupportedOperationException("Complex types not supported.");
}

So I think its failing on isRepetition method. Can anybody suggest me the way to solve the issue ?

My Parquet Data is like -

Key1 = value1
Key2 = value1
Key3 = value1
Key4:
.list:
..element:
...key5:
....list:
.....element:
......certificateSerialNumber = dfsdfdsf45345
......issuerName = CN=Microsoft Windows Verification PCA, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......subjectName = CN=Microsoft Windows, OU=MOPR, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......thumbprintAlgorithm = Sha1
......thumbprintContent = sfdasf42dsfsdfsdfsd
......validFrom = 2009-12-07 21:57:44.000000
......validTo = 2011-03-07 21:57:44.000000
....list:
.....element:
......certificateSerialNumber = dsafdsafsdf435345
......issuerName = CN=Microsoft Root Certificate Authority, DC=microsoft, DC=com
......subjectName = CN=Microsoft Windows Verification PCA, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......thumbprintAlgorithm = Sha1
......thumbprintContent = sdfsdfdsf43543
......validFrom = 2005-09-15 21:55:41.000000
......validTo = 2016-03-15 22:05:41.000000

And I suspect the key4 may be raising the issue because of nested tree. The input data is of Json type, so may be parquet doesn't understand that complex levels as Json

I found a bug in Spark https://issues.apache.org/jira/browse/HIVE-13744

but it states Hive Complex Type Issue. Not Sure, this will fix the issue with parquet or not?

Update 1 Further exploring the parquet and I concluded following -

I have 5 parquet file created while spark.write Among that 2 parquet file is empty so the schema for a column which was supposed to be ArrayType is coming as String type and when I am trying to read it as whole, I saw the above exception

回答1:

Take 1
SPARK-12854 Vectorize Parquet reader indicates that "ColumnarBatch supports structs and arrays" (cf. GitHub pull request 10820) starting with Spark 2.0.0

And SPARK-13518 Enable vectorized parquet reader by default, also starting with Spark 2.0.0, deals with property spark.sql.parquet.enableVectorizedReader (cf. GitHub commit e809074)

My 2 cents: disable that "VectorizedReader" optimization and see what happens.

Take 2
Since the problem has been narrowed down to some empty files that do not display the same schema as "real" files, my 3 cents: experiment with spark.sql.parquet.mergeSchema to see if the schema from real files takes precedence after merging.

Other than that, you might try to eradicate the empty files at write time, with some kind of re-partitioning e.g. coalesce(1) (OK, 1 is a bit caricatural, but you see the point).

来源：https://stackoverflow.com/questions/40305526/spark-exception-complex-types-not-supported-while-loading-parquet

标签

apache-spark

spark-dataframe

parquet