问题
I am using Spark 2.3.1 and python 3.6.5 on ubuntu. While running a dataframe.Describe() function I am getting below error on Jupyter Notebook.
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-19-ea8415b8a3ee> in <module>()
----> 1 df.describe()
~/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py in describe(self, *cols)
1052 if len(cols) == 1 and isinstance(cols[0], list):
1053 cols = cols[0]
-> 1054 jdf = self._jdf.describe(self._jseq(cols))
1055 return DataFrame(jdf, self.sql_ctx)
1056
~/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o132.describe.
: java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2073)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.sql.execution.stat.StatFunctions$.aggResult$lzycompute$1(StatFunctions.scala:273)
at org.apache.spark.sql.execution.stat.StatFunctions$.org$apache$spark$sql$execution$stat$StatFunctions$$aggResult$1(StatFunctions.scala:273)
at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$summary$2.apply$mcVI$sp(StatFunctions.scala:286)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.sql.execution.stat.StatFunctions$.summary(StatFunctions.scala:285)
at org.apache.spark.sql.Dataset.summary(Dataset.scala:2473)
at org.apache.spark.sql.Dataset.describe(Dataset.scala:2412)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:844)
This is the test code I am using:
import findspark
findspark.init('/home/pathirippilly/spark-2.3.1-bin-hadoop2.7')
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,StructType,StructField,IntegerType
spark=SparkSession.builder.appName('Basics').getOrCreate()
df=spark.read.json('people.json')
df.describe() #not working
df.describe().show #not working
I have installed below versions of java,scala,python and spark.
pathirippilly@sparkBox:/usr/lib/jvm$ java -version
openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM (build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)
pathirippilly@sparkBox:/usr/lib/jvm$ bashscala -version
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
python : 3.6.5
Spark version is spark-2.3.1-bin-hadoop2.7
And my environmental variable setup is as below. I have saved all these variables in /etc/environment and invoking it through /etc/bash.bashbrc
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
PYSPARK_DRIVER_OPTION="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook"
PYSPARK_PYTHON=python3
SPARK_HOME='/home/pathirippilly/spark-2.3.1-bin-hadoop2.7/'
PATH=$SPARK_HOME:$PATH
PYTHONPATH=$SPARK_HOME/python/
Also I have not configured spark_env.sh.Is it neccessary to have spark_env.sh configured?
Is it because of any comparability issue ? Or Am i doing something wrong here?
It will be really helpful if some one can route me to the right direction here.
Note:df.show() is working perfect for the same.
回答1:
This issue fixed for me. I have reconfigured entire set up from beginning.And I have prepared my /etc/environment file as below
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/$
export SPARK_HOME='/home/pathirippilly/spark-2.3.1-bin-hadoop2.7'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHONPATH=python3
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64"
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
And I have added below line to /etc/bash.bashrc
source /etc/environment
Note: *
My pyspark is available in my PYTHONPATH, so since everytime when I open a session in my terminal /etc/bash.bashrc will do source /etc/environment which will inturn export all the environmental variables
I have used java-1.8.0-openjdk-amd64 instead of java 10 or 11. But I think 10 or 11 also will work as per pyspark 2.3.1 release document.Not sure.
I have used scala 2.11.12 only.
My py4j module is also available in my PTHONPATH.
I am not sure where I had messed up before. But now with above set up my pyspark 2.3.1 is working fine with Java1.8,Scala 2.11.12,Python 3.6.5 (and without findspark module)
回答2:
OP, I'm having exactly the same setup as you had, in fact we are following the same Spark course in Udemy (setting up everything they say to the letter) and encountered the same error at the same place. The only thing I changed for it to work was the Java version. When the course was made,$ sudo apt-get install default-jre
installed 8, but now it installs 11. I then uninstalled that Java and ran$ sudo apt-get install openjdk-8-jre
then changed the JAVA_HOME path to point to it and now it works.
来源:https://stackoverflow.com/questions/51601819/dataframe-describe-function-on-spark-2-3-1-is-throwing-py4jjavaerror