Use wget with Hadoop?

梦想的初衷 提交于 2019-12-24 04:19:11

问题


I have a dataset (~31GB, zipped file with extension .gz) which is present on a web location, and I want to run my Hadoop program on it. The program is a slight modification from the original WordCount example that comes shipped with Hadoop. In my case, Hadoop is installed on a remote machine (to which I connect via ssh and then run my jobs). The problem is that I can't transfer this large dataset to my home directory on the remote machine (due to disk usage quota). So, I tried searching for if there's a way to use wget to get the dataset and directly pass it onto the HDFS (without saving on my local acccount on the remote machine), but no luck. Does such a way even exist? Any other suggestions to get this working?

I've already tried using Yahoo! VM which comes pre-configured with Hadoop, but it's too slow and plus runs out of memory since the dataset is large.


回答1:


Check out the answer here: putting a remote file into hadoop without copying it to local disk

You can pipe the data from wget to hdfs.

However, you will have a problem - gz is not splittable so you won't be able to run a distributed map/reduce on it.

I suggest you download the file locally, unzip it and then either pipe it in or split it into multiple files and load them into hdfs.



来源:https://stackoverflow.com/questions/20256197/use-wget-with-hadoop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!