How to fix Dataflow unable to serialize my DoFn?

后端 未结 2 1142
抹茶落季
抹茶落季 2020-12-17 20:16

When I run my Dataflow pipeline, I get the exception below complaining that my DoFn can\'t be serialized. How do I fix this?

Here\'s the stack trace:



        
相关标签:
2条回答
  • 2020-12-17 21:02

    If you scroll through the stack trace, one of the causes clearly identifies the data that isn't serializable.

    Caused by: java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
    

    The problem was my DoFn was taking a JobConf instance in the constructor and storing it in an instance variable. I was assuming JobConf was serializable but it turns out it isn't.

    To solve this I did the following

    • I marked the JobConf member variable as transient so that it wouldn't be serialized.
    • I created a separate variable of type byte[] to store a serialized version of JobConf
    • In my constructor I serialized JobConf to a byte[] and stored it in an instance variable.
    • I overrode startBundle and deserialized the JobConf from the byte[]

    Here's a gist with my DoFn.

    0 讨论(0)
  • 2020-12-17 21:04

    To add to what Jeremy says...

    Another common cause of Serializable issues is when you use an anonymous DoFn within a non-static context. Anonymous inner classes have an implicit pointer to the enclosing class, which will cause it to get serialized as well.

    0 讨论(0)
提交回复
热议问题