Middleware to build data-gathering and monitoring for a distributed system [closed]

问题

I am currently looking for a good middleware to build a solution to for a monitoring and maintenance system. We are tasked with the challenge to monitor, gather data from and maintain a distributed system consisting of up to 10,000 individual nodes.

The system is clustered into groups of 5-20 nodes. Each group produces data (as a team) by processing incoming sensor data. Each group has a dedicated node (blue boxes) acting as a facade/proxy for the group, exposing data and state from the group to the outside world. These clusters are geographically separated and may connect to the outside world over different networks (one may run over fiber, another over 3G/Satellite). It is likely we will experience both shorter (seconds/minutes) and longer (hours) outages. The data is persisted by each cluster locally.

This data needs to be collected (continuously and reliably) by external & centralized server(s) (green boxes) for further processing, analysis and viewing by various clients (orange boxes). Also, we need to monitor the state of all nodes through each groups proxy node. It is not required to monitor each node directly, even though it would be good if the middleware could support that (handle heartbeat/state messages from ~10,000 nodes). In case of proxy failure, other methods are available to pinpoint individual nodes.

Furthermore, we need to be able to interact with each node to tweak settings etc. but that seems to be more easily solved since that is mostly manually handled per-node when needed. Some batch tweaking may be needed, but all-in-all it looks like a standard RPC situation (Web Service or alike). Of course, if the middleware can handle this too, via some Request/Response mechanism that would be a plus.

Requirements:

1000+ nodes publishing/offering continuous data
Data needs to be reliably (in some way) and continuously gathered to one or more servers. This will likely be built on top of the middleware using some kind of explicit request/response to ask for lost data. If this could be handled automatically by the middleware this is of course a plus.
More than one server/subscriber needs to be able to be connected to the same data producer/publisher and receive the same data
Data rate is max in the range of 10-20 per second per group
Messages sizes range from maybe ~100 bytes to 4-5 kbytes
Nodes range from embedded constrained systems to normal COTS Linux/Windows boxes
Nodes generally use C/C++, servers and clients generally C++/C#
Nodes should (preferable) not need to install additional SW or servers, i.e. one dedicated broker or extra service per node is expensive
Security will be message-based, i.e. no transport security needed

We are looking for a solution that can handle the communication between primarily proxy nodes (blue) and servers (green) for the data publishing/polling/downloading and from clients (orange) to individual nodes (RPC style) for tweaking settings.

There seems to be a lot of discussions and recommendations for the reversed situation; distributing data from server(s) to many clients, but it has been harder to find information related to the described situation. The general solution seems to be to use SNMP, Nagios, Ganglia etc. to monitor and modify large number of nodes, but the tricky part for us is the data gathering.

We have briefly looked at solutions like DDS, ZeroMQ, RabbitMQ (broker needed on all nodes?), SNMP, various monitoring tools, Web Services (JSON-RPC, REST/Protocol Buffers) etc.

So, do you have any recommendations for an easy-to-use, robust, stable, light, cross-platform, cross-language middleware (or other) solution that would fit the bill? As simple as possible but not simpler.

回答1:

Seems ZeroMQ will fit the bill easily, with no central infrastructure to manage. Since your monitoring servers are fixed, it's really quite a simple problem to solve. This section in the 0MQ Guide may help:

http://zguide.zeromq.org/page:all#Distributed-Logging-and-Monitoring

You mention "reliability", but could you specify the actual set of failures you want to recover? If you are using TCP then the network is by definition "reliable" already.

回答2:

Disclosure: I am a long-time DDS specialist/enthusiast and I work for one of the DDS vendors.

Good DDS implementations will provide you with what you are looking for. Collection of data and monitoring of nodes is a traditional use-case for DDS and should be its sweet spot. Interacting with nodes and tweaking them is possible as well, for example by using so-called content filters to send data to a particular node. This assumes that you have a means to uniquely identify each node in the system, for example by means of a string or integer ID.

Because of the hierarchical nature of the system and its sheer (potential) size, you will probably have to introduce some routing mechanisms to forward data between clusters. Some DDS implementations can provide generic services for that. Bridging to other technologies, like DBMS or web-interfaces, is often supported as well.

Especially if you have multicast at your disposal, discovery of all participants in the system can be done automatically and will require minimal configuration. This is not required though.

To me, it looks like your system is complicated enough to require customization. I do not believe that any solution will "fit the bill easily", especially if your system needs to be fault-tolerant and robust. Most of all, you need to be aware of your requirements. A few words about DDS in the context of the ones you have mentioned:

1000+ nodes publishing/offering continuous data

This is a big number, but should be possible, especially since you have the option to take advantage of the data-partitioning features supported by DDS.

Data needs to be reliably (in some way) and continuously gathered to one or more servers. This will likely be built on top of the middleware using some kind of explicit request/response to ask for lost data. If this could be handled automatically by the middleware this is of course a plus.

DDS supports a rich set of so-called Quality of Service (QoS) settings specifying how the infrastructure should treat that data it is distributing. These are name-value pairs set by the developer. Reliability and data-availability area among the supported QoS-es. This should take care of your requirement automatically.

More than one server/subscriber needs to be able to be connected to the same data producer/publisher and receive the same data

One-to-many or many-to-many distribution is a common use-case.

Data rate is max in the range of 10-20 per second per group

Adding up to a total maximum of 20,000 messages per second is doable, especially if data-flows are partitioned.

Messages sizes range from maybe ~100 bytes to 4-5 kbytes

As long as messages do not get excessively large, the number of messages is typically more limiting than the total amount of kbytes transported over the wire -- unless large messages are of very complicated structure.

Nodes range from embedded constrained systems to normal COTS Linux/Windows boxes

Some DDS implementations support a large range of OS/platform combinations, which can be mixed in a system.

Nodes generally use C/C++, servers and clients generally C++/C#

These are typically supported and can be mixed in a system.

Nodes should (preferable) not need to install additional SW or servers, i.e. one dedicated broker or extra service per node is expensive

Such options are available, but the need for extra services depends on the DDS implementation and the features you want to use.

Security will be message-based, i.e. no transport security needed

That certainly makes life easier.

来源：https://stackoverflow.com/questions/13483809/middleware-to-build-data-gathering-and-monitoring-for-a-distributed-system

标签

monitoring

RabbitMQ

zeromq

data-distribution-service

data-collection