Why the SequenceFile is truncated?

问题

I am learning Hadoop and this problem has baffled me for a while. Basically I am writing a SequenceFile to disk and then read it back. However, every time I get an EOFException when reading. A deeper look reveals that when writing the sequence file, it is prematurely truncated, and it always happens after writing index 962, and the file always has a fixed size of 45056 bytes.

I am using Java 8 and Hadoop 2.5.1 on a MacBook Pro. In fact, I tried the same code on another Linux machine under Java 7, but the same things happens.

I can rule out writer/reader is not properly closed. I tried using the old styled try/catch with an explicit writer.close() as shown in the code, and also use the newer try-with-resource approach. Both are not working.

Any help will be highly appreciated.

Following is the code I am using:

public class SequenceFileDemo {

private static final String[] DATA = { "One, two, buckle my shoe",
    "Three, four, shut the door",
    "Five, six, pick up sticks",
    "Seven, eight, lay them straight",
    "Nine, ten, a big fat hen" };

public static void main(String[] args) throws Exception {
    String uri = "file:///Users/andy/Downloads/puzzling.seq";
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);

    Path path = new Path(uri);      
    IntWritable key = new IntWritable();
    Text value = new Text();

    //API change
    try {
        SequenceFile.Writer writer = SequenceFile.createWriter(conf, 
            stream(fs.create(path)),
            keyClass(IntWritable.class),
            valueClass(Text.class));

        for ( int i = 0; i < 1024; i++ ) {
            key.set( i);
            value.clear();
            value.set(DATA[i % DATA.length]);

            writer.append(key, value);
            if ( (i-1) %100 == 0 ) writer.hflush();
            System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
        }

        writer.close();

    } catch (Exception e ) {
        e.printStackTrace();
    }


    try {
        SequenceFile.Reader reader = new SequenceFile.Reader(conf, 
                SequenceFile.Reader.file(path));
        Class<?> keyClass = reader.getKeyClass();
        Class<?> valueClass = reader.getValueClass();

        boolean isWritableSerilization = false;
        try {
            keyClass.asSubclass(WritableComparable.class);
            isWritableSerilization = true;
        } catch (ClassCastException e) {

        }

        if ( isWritableSerilization ) {
            WritableComparable<?> rKey = (WritableComparable<?>) ReflectionUtils.newInstance(keyClass, conf);
            Writable rValue = (Writable) ReflectionUtils.newInstance(valueClass, conf);
            while(reader.next(rKey, rValue)) {
                System.out.printf("[%s] %d %s=%s\n",reader.syncSeen(), reader.getPosition(), rKey, rValue);
            }
        } else {
            //make sure io.seraizliatons has the serialization in use when write the sequence file
        }

        reader.close();
    } catch(IOException e) {
        e.printStackTrace();
    }
}

}

回答1:

I actually found the error, it is because you are never closing the created stream in Writer.stream(fs.create(path)).

For some reason the close doesn't propagate down to the stream you just created there. This is a bug I suppose, but I'm too lazy to look it up in Jira for now.

One way to fix your problems is to simply use Writer.file(path) instead.

Obviously, you can also just close the create stream explicitly. Find my corrected example below:

    Path path = new Path("file:///tmp/puzzling.seq");

    try (FSDataOutputStream stream = fs.create(path)) {
        try (SequenceFile.Writer writer = SequenceFile.createWriter(conf, Writer.stream(stream),
                Writer.keyClass(IntWritable.class), Writer.valueClass(NullWritable.class))) {

            for (int i = 0; i < 1024; i++) {
                writer.append(new IntWritable(i), NullWritable.get());
            }
        }
    }

    try (SequenceFile.Reader reader = new SequenceFile.Reader(conf, Reader.file(path))) {
        Class<?> keyClass = reader.getKeyClass();
        Class<?> valueClass = reader.getValueClass();

        WritableComparable<?> rKey = (WritableComparable<?>) ReflectionUtils.newInstance(keyClass, conf);
        Writable rValue = (Writable) ReflectionUtils.newInstance(valueClass, conf);
        while (reader.next(rKey, rValue)) {
            System.out.printf("%s = %s\n", rKey, rValue);
        }

    }

回答2:

I think you are missing writer.close() after write loop. That should gaurantee a final flush before you start reading.

回答3:

Thanks to Thomas.

It boils down to if the writer created "owns" the stream of not. When creating the writer, if we pass in the option Writer.file(path), the writer "owns" the underlying stream created internally, and will close it when close() is called. Yet if we pass in Writer.stream(aStream), the writer assumes someone else is response for that stream and won't close it when close() is called. In short, it is not a bug, just that I do not understand it well enough. .

来源：https://stackoverflow.com/questions/27916872/why-the-sequencefile-is-truncated

标签

java

Hadoop

sequencefile