Transferring files from remote node to HDFS with Flume

点点圈 提交于 2019-12-10 02:43:39

问题


I have a bunch of binary files compressed into *gz format. These are generated on a remote node and must be transferred to HDFS located one of the datacenter's server.

I'm exploring the option of sending the files with Flume; I explore the option of doing this with a Spooling Directory configuration, but apparently this only works when the file's directory is located locally on the same HDFS node.

Any suggestions how to tackle this problem?


回答1:


There is no out-of-box solution for such case. But you could try these workarounds:

  1. You could create your own source implementation for such purpose (by using Flume SDK). For example, this project seems to be able to connect to remote dir by ssh and use it as source.
  2. You could create a custom scheduled script to copy remote files into local spool directory periodically, then use it as a spooling dir source for flume agent.
  3. You could try to create another script to read your remote data and then to write it into its output and use such script in the Exec Source.
  4. You could locate your flume (and agent) on the machine, where data is located (see Can Spool Dir of flume be in remote machine? ).



回答2:


Why don't you run two different Flume agents, one on the remote machine and one on your date node. The agent on your remote machine can read the spooling directory and send it to avro sink. And the agent on the datanode can read through avro source and dump the data to HDFS.



来源:https://stackoverflow.com/questions/26168820/transferring-files-from-remote-node-to-hdfs-with-flume

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!