问题
What will be the best approach towards implementing a notification system for Hadoop for data availability such that whenever new data comes its creates a notification which can be utilized by job control framework to start their job which depends on that data. Here the main concern is as soon as the data becomes available the job should get triggered instead job polling on NameNode for availability of data?
回答1:
What I would do is use a producer/consumer model that can interact with each other using a queue like for example Amazon SQS.
The producer will maintain a list of watched directories, and do hadoop fs -test -e /path/to/watched/dir
every x seconds (where x should be a parameter), and if the command returns 0 with $?
then you can send a message to the queue. The content of the message could be just the name of the directory that just appeared, or you could add some metadata and send it as a JSON object for example with additional fields.
On the other side the consumer will listen to the queue every y seconds (where y should be a parameter), and as soon as there is new data you can start your job on this directory.
来源:https://stackoverflow.com/questions/14436748/event-notification-of-data-availability-in-hdfs