When I run my Dataflow pipeline, I get the exception below complaining that my DoFn can\'t be serialized. How do I fix this?
Here\'s the stack trace:
If you scroll through the stack trace, one of the causes clearly identifies the data that isn't serializable.
Caused by: java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
The problem was my DoFn was taking a JobConf instance in the constructor and storing it in an instance variable. I was assuming JobConf was serializable but it turns out it isn't.
To solve this I did the following
Here's a gist with my DoFn.
To add to what Jeremy says...
Another common cause of Serializable issues is when you use an anonymous DoFn within a non-static context. Anonymous inner classes have an implicit pointer to the enclosing class, which will cause it to get serialized as well.