本次的JoelOnSoftware 问答活动中,提到了一个古老的问题,什么是log以及如何去log。平常的trace/error/warning/info方式在大型的分布式系统中不是非常有用。你需要将所有的信息都记录下来才能解决遇到的问题。
















Make logging efficient from the start so you aren't afraid to use it. Create a dead simple to use log library that makes logging trivial for developers. Document it. Provide example code. Check for it during code reviews. Log to a separate task and let the task push out log data when it can. Use a preallocated buffer pool for log messages so memory allocation is just pop and push. Log integer values for very time sensitive code. For less time sensitive code sprintf'ing into a preallocated buffer is usually quite fast. When it's not you can use reference counted data structures and do the formatting in the logging thread. Triggering a log message should take exactly one table lookup. Then the performance hit is minimal. Don't do any formatting before it is determined the log is needed. This removes constant overhead for each log message. Allow fancy stream based formatting so developers feel free to dump all the data they wish in any format they wish. In an ISR context do not take locks or you'll introduce unbounded variable latency into the system. Directly format data into fixed size buffers in the log message. This way there is no unavoidable overhead. Make the log message directly queueable to the log task so queuing doesn't take more memory allocations. Memory allocation is a primary source of arbitrary latency and dead lock because of the locking. Avoid memory allocation in the log path. Make the logging thread a lower priority so it won't starve the main application thread. Store log messages in a circular queue to limit resource usage. Write log messages to disk in big sequential blocks for efficiency. Every object in your system should be dumpable to a log message. This makes logging trivial for developers. Tie your logging system into your monitoring system so all the logging data from every process on every host winds its way to your centralized monitoring system. At the same time you can send all your SLA related metrics and other stats. This can all be collected in the back ground so it doesn't impact performance. Add meta data throughout the request handling process that makes it easy to diagnose problems and alert on future potential problems. Map software components to subsystems that are individually controllable, cross application trace levels aren't useful. Add a command ports to processes that make it easy to set program behaviors at run-time and view important statistics and logging information. Log information like task switch counts and times, queue depths and high and low watermarks, free memory, drop counts, mutex wait times, CPU usage, disk and network IO, and anything else that may give a full picture of how your software is behaving in the real world.



