Hadoop: how to access (many) photo images to be processed by map/reduce?

后端 未结 3 1580
悲哀的现实
悲哀的现实 2020-12-08 05:22

I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it\'s a dog. I basically want to do the

3条回答
  •  北荒
    北荒 (楼主)
    2020-12-08 06:03

    I have 10M+ photos saved on the local file system.

    Assuming it takes a sec to put each file into the sequence file. It will take ~115 days for the conversion of individual files into a sequence file. With parallel processing also on a single machine, I don't see much improvement because disk read/write will be a bottle neck with reading the photo files and writing the sequence file. Check this Cloudera article on small files problem. There is also a reference to a script which converts a tar file into a sequence file and how much time it took for the conversion.

    Basically the photos have to be processed in a distributed way for converting them into sequence. Back to Hadoop :)

    According to the Hadoop - The Definitive Guide

    As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory.

    So, directly loading 10M of files will require around 3,000 MB of memory for just storing the namespace on the NameNode. Forget about streaming the photos across nodes during the execution of the job.

    There should be a better way of solving this problem.


    Another approach is to load the files as-is into HDFS and use CombineFileInputFormat which combines the small files into a input split and considers data locality while calculating the input splits. Advantage of this approach is that the files can be loaded into HDFS as-is without any conversion and there is also not much data shuffling across nodes.

提交回复
热议问题