问题 spark udf函数不能序列化

泄露秘密 提交于 2019-12-16 23:27:03

如下在实现spark的udf函数时:

val randomNew = (arra:Seq[String], n:Int)=>{
      if(arra.size < n){
        return arra.toSeq
      }
      var arr = ArrayBuffer[String]()
      arr ++= arra
      var outList:List[String]=Nil
      var border=arr.length//随机数范围
      for(i<-0 to n-1){//生成n个数
      val index=(new Random).nextInt(border)
        outList=outList:::List(arr(index))
        arr(index)=arr.last//将最后一个元素换到刚取走的位置
        arr=arr.dropRight(1)//去除最后一个元素
        border-=1
      }
      outList.toSeq
    }
sqlContext.udf.register("randomNew", randomNew)

执行出现如下错误:

Caused by: org.apache.spark.SparkException: Task not serializable
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2067)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:86)
	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:80)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
	... 28 more
Caused by: java.io.NotSerializableException: java.lang.Object
Serialization stack:

错误就是 return arra.toSeq 这块的问题,如果要使用return,就要使用模式匹配做,不然就会出现上述的错误。

修改后的代码如下:

val randomNew = (arra: Seq[String], n: Int) => {
      val routeKey = arra.size <= n
      routeKey match {
        case true => arra
        case _ => {
          var arr = ArrayBuffer[String]()
          arr ++= arra
          var outList: List[String] = Nil
          var border = arr.length //随机数范围
          for (i <- 0 to n - 1) {
            //生成n个数
            val index = (new Random).nextInt(border)
            outList = outList ::: List(arr(index))
            arr(index) = arr.last //将最后一个元素换到刚取走的位置
            arr = arr.dropRight(1) //去除最后一个元素
            border -= 1
          }
          outList
        }
      }
    }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!