Use wget with Hadoop? | 易学教程

问题

I have a dataset (~31GB, zipped file with extension .gz) which is present on a web location, and I want to run my Hadoop program on it. The program is a slight modification from the original WordCount example that comes shipped with Hadoop. In my case, Hadoop is installed on a remote machine (to which I connect via ssh and then run my jobs). The problem is that I can't transfer this large dataset to my home directory on the remote machine (due to disk usage quota). So, I tried searching for if there's a way to use wget to get the dataset and directly pass it onto the HDFS (without saving on my local acccount on the remote machine), but no luck. Does such a way even exist? Any other suggestions to get this working?

I've already tried using Yahoo! VM which comes pre-configured with Hadoop, but it's too slow and plus runs out of memory since the dataset is large.

回答1:

Check out the answer here: putting a remote file into hadoop without copying it to local disk

You can pipe the data from wget to hdfs.

However, you will have a problem - gz is not splittable so you won't be able to run a distributed map/reduce on it.

I suggest you download the file locally, unzip it and then either pipe it in or split it into multiple files and load them into hdfs.

来源：https://stackoverflow.com/questions/20256197/use-wget-with-hadoop

标签

java

Hadoop

MapReduce

wget