I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS\'s put command to
For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.
Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).
The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.
This gives me excellent transfer performance, since multiple files are read/written at the same time.
Maybe not the answer you were looking for, but hopefully helpful anyway :-).