Hadoop gzip compressed files

后端 未结 4 1033
攒了一身酷
攒了一身酷 2020-12-09 10:37

I am new to hadoop and trying to process wikipedia dump. It\'s a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed

4条回答
  •  猫巷女王i
    2020-12-09 10:47

    GZIP files cannot be partitioned in any way, due to a limitation of the codec. 6.7GB really isn't that big, so just decompress it on a single machine (it will take less than an hour) and copy the XML up to HDFS. Then you can process the Wikipedia XML in Hadoop.

    Cloud9 contains a WikipediaPageInputFormat class that you can use to read the XML in Hadoop.

提交回复
热议问题