Log files in massively distributed systems

问题

I do a lot of work in the grid and HPC space and one of the biggest challenges we have with a system distributed across hundreds (or in some case thousands) of servers is analysing the log files.

Currently log files are written locally to the disk on each blade but we could also consider publishing logging information using for example a UDP Appender and collect it centally.

Given that the objective is to be able to identify problems in as close to real time as possible, what should we do?

回答1:

First, synchronize all clocks in the system using NTP.

Second, if you are collecting the logs in a single location (like the UDP appender you mention) make sure the logs have enough information to actually help. I would include at least the server that generated the log, the time it happened, and the message. If there is any sort of transaction id, or job id type concept, include that also.

Since you mentioned a UDP Appender I am guessing you are using log4j (or one of it's siblings). Log4j has an MDC class that allows extra information to be passed along through a processing thread. it can help collect some of the extra information and pass it along.

回答2:

Are you using Apache? If so you could have a look at mod_log_spread Though you may have too big an infrastructure to make it maintainable. The other option is to look at "broadcasting" or "multicasting" your log messages and having dedicated logging servers subscribing to those feeds and collating them

来源：https://stackoverflow.com/questions/35292/log-files-in-massively-distributed-systems

标签

distributed-computing

hpc