问题
I have a dataset (~31GB, zipped file with extension .gz) which is present on a web location, and I want to run my Hadoop program on it. The program is a slight modification from the original WordCount example that comes shipped with Hadoop. In my case, Hadoop is installed on a remote machine (to which I connect via ssh and then run my jobs). The problem is that I can't transfer this large dataset to my home directory on the remote machine (due to disk usage quota). So, I tried searching for if there's a way to use wget to get the dataset and directly pass it onto the HDFS (without saving on my local acccount on the remote machine), but no luck. Does such a way even exist? Any other suggestions to get this working?
I've already tried using Yahoo! VM which comes pre-configured with Hadoop, but it's too slow and plus runs out of memory since the dataset is large.
回答1:
Check out the answer here: putting a remote file into hadoop without copying it to local disk
You can pipe the data from wget to hdfs.
However, you will have a problem - gz is not splittable so you won't be able to run a distributed map/reduce on it.
I suggest you download the file locally, unzip it and then either pipe it in or split it into multiple files and load them into hdfs.
来源:https://stackoverflow.com/questions/20256197/use-wget-with-hadoop