Why can't hadoop split up a large text file and then compress the splits using gzip?

后端 未结 2 1579
独厮守ぢ
独厮守ぢ 2020-12-17 01:01

I\'ve recently been looking into hadoop and HDFS. When you load a file into HDFS, it will normally split the file into 64MB chunks and distribute these chunks around your cl

2条回答
  •  情话喂你
    2020-12-17 01:20

    The HDFS has a limited scope of being only a distributed file-system service and doesn't do heavy-lifting operations such as compressing the data. The actual process of data compression is delegated to distributed execution frameworks like Map-Reduce, Spark, Tez etc. So compression of data/files is the concern of the execution framework and not that of the File System.

    Additionally the presence of container file formats like Sequence-file, Parquet etc negates the need of HDFS to compress the Data blocks automatically as suggested by the question.

    So to summarize due to design philosophy reasons any compression of data must be done by the execution engine not by the file system service.

提交回复
热议问题