Handling Writables fully qualified name changes in Hadoop SequenceFile

问题

I have a bunch of Hadoop SequenceFiles that have been written with some Writable subclass I wrote. Let's call it FishWritable.

This Writable worked out well for a while, until I decided there was need for a package renaming for clarity. So now the fully qualified name of FishWritable is com.vertebrates.fishes.FishWritable instead of com.mammals.fishes.FishWritable. It was a reasonable change given how the scope of the package in question had evolved.

Then I discover that none of my MapReduce jobs will run, as they crash when attempting to initialize the SequenceFileRecordReader:

java.lang.RuntimeException: java.io.IOException: WritableName can't load class: com.mammals.fishes.FishWritable
at org.apache.hadoop.io.SequenceFile$Reader.getKeyClass(SequenceFile.java:1949)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1899)
...

A couple of options for dealing with this is immediately apparent. I can simply rerun all my previous jobs to regenerate the output with the up to date key class name, running any dependent jobs in sequence. This can obviously be quite time consuming and sometimes not even possible.

Another possibility might be to write a simple job that reads the SequenceFile as text and replaces any instances of the class name with the new one. This is basically method #1 with a tweak that makes it less complicated to do. If I have a lot of big files it's still quite impractical.

Is there a better way to deal with refactorings of fully qualified class names used in SequenceFiles? Ideally, I'm looking for some way to specify a new fallback class name if the specified one is not found, to allow for running against both dated and updated types of this SequenceFile.

回答1:

The org.apache.hadoop.io.WritableName class mentioned in the exception stack trace has some useful methods.

From the doc:

Utility to permit renaming of Writable implementation classes without invalidiating files that contain their class name.

// Add an alternate name for a class.
public static void addName(Class writableClass, String name)

In your case you could call this before reading from your SequenceFiles:

WritableName.addName(com.vertebrates.fishes.FishWritable.class, "com.mammals.fishes.FishWritable");

This way, when attempting to read a com.mammals.fishes.FishWritable from an old SequenceFile, the new com.vertebrates.fishes.FishWritable class will be used.

PS: Why was the fish in the mammals package in the first place? ;)

回答2:

Looking at the spec for sequencefile it seems clear there isn't any consideration for alternative class names.

If I wasn't in a position to re-write the data, one more option is to have com.mammals.fishes.writable extend com.vertebrates.fishes.writable and just annotate it as deprecated so nobody accidentally adds code to the empty wrapper. After a long enough time, the data written with the old class will be obsoleted and you'll be able to safely delete the mammals class.

来源：https://stackoverflow.com/questions/18884666/handling-writables-fully-qualified-name-changes-in-hadoop-sequencefile

标签

serialization

Hadoop

sequencefile