How do I get last modified date from a Hadoop Sequence File?

问题

I am using a mapper that converts BinaryFiles (jpegs) to a Hadoop Sequence File (HSF):

    public void map(Object key, Text value, Context context) 
throws IOException, InterruptedException {

    String uri = value.toString().replace(" ", "%20");
    Configuration conf = new Configuration();

    FSDataInputStream in = null;
    try {
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        in = fs.open(new Path(uri));
        java.io.ByteArrayOutputStream bout = new ByteArrayOutputStream();
        byte buffer[] = new byte[1024 * 1024];

        while( in.read(buffer, 0, buffer.length) >= 0 ) {
            bout.write(buffer);
        }
        context.write(value, new BytesWritable(bout.toByteArray()));

I then have a second mapper that reads the HSF, thus:

public  class ImagePHashMapper extends Mapper<Text, BytesWritable, Text, Text>{

    public void map(Text key, BytesWritable value, Context context) throws IOException,InterruptedException {
        //get the PHash for this specific file
        String PHashStr;
        try {
            PHashStr = calculatePhash(value.getBytes());

and calculatePhash is:

        static String calculatePhash(byte[] imageData) throws NoSuchAlgorithmException {
        //get the PHash for this specific data
        //PHash requires inputstream rather than byte array
        InputStream is = new ByteArrayInputStream(imageData);
        String ph;
        try {
            ImagePHash ih = new ImagePHash();
            ph = ih.getHash(is);
            System.out.println ("file: " + is.toString() + " phash: " +ph);
        } catch (Exception e) {
            e.printStackTrace();
            return "Internal error with ImagePHash.getHash";
        } 

        return ph;

This all works fine, but I want calculatePhash to write out each jpeg's last modified date. I know I can use file.lastModified() to get the last modified date in a file but is there any way to get this in either map or calculatePhash? I'm a noob at Java. TIA!

回答1:

Hi i think that you want is the modification time of each input File that enters in your mapper. If it is the case you just have to add a few lines to the mpkorstanje solution:

FileSystem fs = FileSystem.get(URI.create(uri), conf);
long moddificationTime = fs
    .getFileStatus((FileSplit)context.getInputSplit())
    .getPath()).lastModified();

With this few changes you can get the fileStatus of each inputSlipt and you can add it to your key in order to use later in your process or make a multipleOutput reduce and write somewhere else in your reduce phase.

I hope this will be usefull

回答2:

Haven't used Hadoop much but I don't think you should use file.lastModified(). Hadoop abstracted the file system somewhat.

Have you tried using FileSystem.getFileStatus(path) in map? It gets you a FileStatus object that has a modification time. Something like

FileSystem fs = FileSystem.get(URI.create(uri), conf);
long moddificationTime = fs.getFileStatus(new Path(uri)).lastModified();

回答3:

Use the following code snippet to get Map of all the files modified under particular directory path you provide:

private static HashMap lastModifiedFileList(FileSystem fs, Path rootDir) {
    // TODO Auto-generated method stub
    HashMap modifiedList = new HashMap();
    try {

        FileStatus[] status = fs.listStatus(rootDir);
        for (FileStatus file : status) {
            modifiedList.put(file.getPath(), file.getModificationTime());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    return modifiedList;
}

回答4:

In Hadoop each files are consist of BLOCK. Generally Hadoop FileSystem are referred the package org.apache.hadoop.fs. If your input files are present in HDFS means you need to import the above package

FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));

org.apache.hadoop.fs.FileStatus fileStatus=fs.getFileStatus(new Path(uri));
long modificationDate = fileStatus.getModificationTime();

Date date=new Date(modificationDate);
SimpleDateFormat df2 = new SimpleDateFormat("dd/MM/yy HH:mm:ss");
String dateText = df2.format(date);

I hope this will help you.

来源：https://stackoverflow.com/questions/26936932/how-do-i-get-last-modified-date-from-a-hadoop-sequence-file

标签

java

date

Hadoop

MapReduce