Use a method inside a UDF function Spark Scala

匿名 (未验证) 提交于 2019-12-03 02:38:01

问题:

I want to use a method located in another class inside a user-designed function but it's not working.

I have a method:

 def traitementDataFrameEleve(sc:SparkSession, dfRedis:DataFrame, domainMail:String, dir:String):Boolean ={      def loginUDF = udf((sn: String, givenName:String) => {             LoginClass.GenerateloginPersone(sn,givenName,dfr)           })      dfEleve.withColumn("ENTPersonLogin",loginUDF(dfEleve("sn"),dfEleve("givenName"))) } 

LoginClass is a class that contains the GenerateloginPersone method.

Output error :

org.apache.spark.SparkException: Failed to execute user defined function(anonfun$loginUDF$1$1: (string, string) => string)     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)     at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)     at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)     at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)     at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)     at org.apache.spark.scheduler.Task.run(Task.scala:99)     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)     at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)     at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NullPointerException     at org.apache.spark.sql.Dataset.schema(Dataset.scala:410)     at org.apache.spark.sql.Dataset.printSchema(Dataset.scala:419)     at IntegrationDonneesENTLea_V1_AcBordeaux.LoginClass$.GenerateloginPersone(LoginClass.scala:16)     at IntegrationDonneesENTLea_V1_AcBordeaux.Eleve$$anonfun$loginUDF$1$1.apply(Eleve.scala:25)     at IntegrationDonneesENTLea_V1_AcBordeaux.Eleve$$anonfun$loginUDF$1$1.apply(Eleve.scala:23)     ... 16 more 

Thank you.

回答1:

It is not allowed to access:

  • distributed data structures (like Dataset or RDD).
  • SparkConext / SparkSession

from Spark task (transformation, udf application). This is why you get a NPE.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!