Spark SQL fails because “Constant pool has grown past JVM limit of 0xFFFF”

I am running this code on EMR 4.6.0 + Spark 1.6.1 :

val sqlContext = SQLContext.getOrCreate(sc)
val inputRDD = sqlContext.read.json(input)

try {
    inputRDD.filter("`first_field` is not null OR `second_field` is not null").toJSON.coalesce(10).saveAsTextFile(output)
    logger.info("DONE!")
} catch {
    case e : Throwable => logger.error("ERROR" + e.getMessage)
}

In the last stage of saveAsTextFile, it fails with this error:

16/07/15 08:27:45 ERROR codegen.GenerateUnsafeProjection: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown past JVM limit of 0xFFFF
/* 001 */
/* 002 */ public java.lang.Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] exprs) {
/* 003 */   return new SpecificUnsafeProjection(exprs);
/* 004 */ }
(...)

What could be the reason? Thanks

Andrew

This is due to known limitation of Java for generated classes to go beyond 64Kb.

This limitation has been worked around in SPARK-18016 which is fixed in Spark 2.3 - will be released in Jan/2018.

Nhan Trinh

Solved this problem by dropping all the unused column in the Dataframe, or just filter columns you actually need.

Turnes out Spark Dataframe can not handle super wide schemas. There is no specific number of columns where Spark might break with “Constant pool has grown past JVM limit of 0xFFFF” - it depends on kind of query, but reducing number of columns can help to workaround this issue.

The underlying root cause is in JVM's 64kb for generated Java classes - see also Andrew's answer.

For future reference, this issue was fixed in spark 2.3 (As Andrew noted).

If you encounter this issue on Amazon EMR, upgrade to release version 5.13 or above.

来源：https://stackoverflow.com/questions/38391863/spark-sql-fails-because-constant-pool-has-grown-past-jvm-limit-of-0xffff

标签

java

scala

apache-spark

amazon-emr