Hadoop gzip compressed files

后端未结

关注

 4  1033

攒了一身酷 2020-12-09 10:37

I am new to hadoop and trying to process wikipedia dump. It\'s a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed

4条回答

猫巷女王i (楼主)

2020-12-09 10:47

GZIP files cannot be partitioned in any way, due to a limitation of the codec. 6.7GB really isn't that big, so just decompress it on a single machine (it will take less than an hour) and copy the XML up to HDFS. Then you can process the Wikipedia XML in Hadoop.

Cloud9 contains a WikipediaPageInputFormat class that you can use to read the XML in Hadoop.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...