State management not serializable

不问归期 提交于 2019-12-24 08:12:19

问题


In my application, I want to keep track of multiple states. Thus I tried to encapsulate the whole state management logic within a class StateManager as follows:

@SerialVersionUID(xxxxxxxL)
class StateManager(
    inputStream: DStream[(String, String)],
    initialState: RDD[(String, String)]
) extends Serializable {
  lazy val state = inputStream.mapWithState(stateSpec).map(_.get)
  lazy val stateSpec = StateSpec
    .function(trackStateFunc _)
    .initialState(initialState)
    .timeout(Seconds(30))
  def trackStateFunc(key: String, value: Option[String], state: State[String]): Option[(String, String)] = {}
}

object StateManager { def apply(dstream: DStream[(String, String)], initialstate: RDD[(String, String)]) = new StateManager(_dStream, _initialState) }

The @SerialVersionUID(xxxxxxxL) ... extends Serializable is an attempt to solve my problem.

But when calling StateManager from my main class like the following:

val lStreamingContext = StreamingEnvironment(streamingWindow, checkpointDirectory)
val statemanager= StateManager(lStreamingEnvironment.sparkContext, 1, None)
val state= statemanager.state(lKafkaStream)

state.foreachRDD(_.foreach(println))

(see below for StreamingEnvironment), I get:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
[...]
Caused by: java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being serialized  possibly as a part of closure of an RDD operation. This is because  the DStream object is being referred to from within the closure.  Please rewrite the RDD operation inside this DStream to avoid this.  This has been enforced to avoid bloating of Spark tasks  with unnecessary objects.

The error is clear, but still I don't get on what point does it trigger.

Where does it trigger? What could I do to solve this and have a reusable class?


The might-be-useful StreamingEnvironment class:

class StreamingEnvironment(mySparkConf: SparkConf, myKafkaConf: KafkaConf, myStreamingWindow: Duration, myCheckPointDirectory: String) {
  val sparkContext = spark.SparkContext.getOrCreate(mySparkConf)
  lazy val streamingContext = new StreamingContext(sparkContext , mMicrobatchPeriod)

  streamingContext.checkpoint(mCheckPointDirectory)
  streamingContext.remember(Minutes(1))

  def stream() = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, myKafkaConf.mBrokers, myKafkaConf.mTopics)
}

object StreamingEnvironment {
  def apply(streamingWindow: Duration, checkpointDirectory: String) = {
    //setup sparkConf and kafkaConf

    new StreamingEnvironment(sparkConf , kafkaConf, streamingWindow, checkpointDirectory)
  }
}

回答1:


When we lift a method into a function, the outer reference to the parent class will be part of that function reference, like here: function(trackStateFunc _) Declaring trackStateFunc directly as a function (i.e. as a val) will probably take care of the problem.

Also note that marking a class Serializable does not make it magically so. DStream is not serializable and should be annotated as @transient, which will probably solve the issue as well.



来源:https://stackoverflow.com/questions/41460046/state-management-not-serializable

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!