What to log
Marcel Koert
Innovative Platform Engineer | DevOps Engineer | Site Reliability Engineer | IT Educator | Founder of Melomar-IT
Quick over view.
All Applications that you write should have good logging. But what is good logging? Let’s start with a few No Brainers. Your applications uptime is always more important than your logging. If the data that you want to write is more important than it should not be in a log but in a database of some sort. Logging is only transient data.
We can define 3 types of logging:
- Application log: All application failures, start , stop's with their reasons. (See log levels)
- Audit logs: All functional actions done on behave of consumers, with the following data who, when and what.
- Access logs: HTTP based log what ip connected to what end point at what time and with what code was answered.
LogTypes
Audit
Why Log Audit Logging ? Inevitably, someone asks why event data should be logged on a given system. Essentially there are four categories of reasons:
- Accountability – Log data can identify what accounts are associated with certain events. This information then can be used to highlight where training and/or disciplinary actions are needed.
- Reconstruction – Log data can be reviewed chronologically to determine what was happening both before and during an event. For this to happen, the accuracy and coordination of system clocks are critical. To accurately trace activity, clocks need to be regularly synchronized to a central source to ensure that the date/time stamps are in synch.
- Intrusion Detection – Unusual or unauthorized events can be detected through the review of log data, assuming that the correct data is being logged and reviewed. The definition of what constitutes unusual activity varies, but can include failed login attempts, login attempts outside of designated schedules, locked accounts, port sweeps, network activity levels, memory utilization, key file/data access, etc.
- Problem Detection– In the same way that log data can be used to identify security events, it can be used to identify problems that need to be addressed. For example, investigating causal factors of failed jobs, resource utilization, trending and so on.
What to Log? Essentially, for each system monitored and likely event condition there must be enough data logged for determinations to be made. At a minimum, you need to be able to answer the standard who, what and when questions.
- Was an Authentication request a success or a failure
- Was an Authorization request a success and failure
Both of these questions can be asked for a individual or in more general terms of how many did failed.
Retention Period of Audit logging. What is a normal time to keep audit logs , I think 90 days should be enough for problems to be noticed and researched, but every industry has their own needs so it could be a lot longer.
Application
Which events to log
The level and content of security monitoring, alerting and reporting needs to be set during the requirements and design stage of projects, and should be proportionate to the information security risks. This can then be used to define what should be logged. There is no one size fits all solution, and a blind checklist approach can lead to unnecessary "alarm fog" that means real problems go undetected. Where possible, always log:
- Input validation failures e.g. protocol violations, unacceptable encodings, invalid parameter names and values
- Output validation failures e.g. database record set mismatch, invalid data encoding
- Session management failures e.g. cookie session identification value modification
- Application errors and system events e.g. syntax and runtime errors, connectivity problems, performance issues, third party service error messages, file system errors, file upload virus detection, configuration changes
- Application and related systems start-ups and shut-downs, and logging initialization (starting, stopping or pausing)
- Use of higher-risk functionality e.g. network connections, addition or deletion of users, changes to privileges, assigning users to tokens, adding or deleting tokens, use of systems administrative privileges, access by application administrators, all actions by users with administrative privileges, access to payment cardholder data, use of data encrypting keys, key changes, creation and deletion of system-level objects, data import and export including screen-based reports, submission of user-generated content - especially file uploads
Optionally consider if the following events can be logged and whether it is desirable information:
- Sequencing failure
- Excessive use
- Data changes
- Fraud and other criminal activities
- Suspicious, unacceptable or unexpected behaviour
- Modifications to configuration
- Application code file and/or memory changes
log levels
INFO
- All important information that we need for normal operations.
- All interesting global information on performance and trends. This logging should be minimal. Typical information that is interesting: Application start-up and shutdown information
- User logging on, user logging off
- Pages (requestUri's) being accessed
- Performance of service calls, indicating if the result was retrieved from a cache.
WARN
All information about things that are going wrong but do not need intervention from humans. Something is not as it should be, but everything does work. For example, in case fall back content is send back.
ERROR
All information about things that are going wrong that do need human intervention. One or several sessions (users) are impacted. For example, a service call timed out or a page cannot be found.
DEBUG
All actions done by humans when using the application. Logging that should help a administrator to determine the cause of an error. All logging should be understandable and relevant for administrators.
TRACE
All actions done by the application. All other logging, which should be understandable and relevant for developers. Examples are method entries and exits, results returned from services and databases. This is the only level at which stack traces are allowed. A stack trace at any other level is a program error.
FATAL
Everyone is affected, the entire application is not working. For example, application properties are not present.
Retention What is a normal time to keep Application logs: 90 Days online & 1 year offline for all technical logs, but every industry has their own needs so it could be a lot longer.
Access log
An access log is a list of all the requests for individual files/endpoints that have been requested from a API. The access logs can offer a great deal of information regarding the incoming requests to your API If you need to analyse these logs in large amounts then it may be beneficial to use a log analysis tool that can “crunch the numbers” for you much faster. Example: 127.0.0.1 - peter [9/Feb/2017:10:34:12 -0700] "GET /sample-image.png HTTP/2" 200 1479
Retention
What is a normal time to keep Application logs: 90 Days online & 1 year offline for all technical logs, but every industry has their own needs so it could be a lot longer.
Conclusion
Whatever you do think about it in the design phase, do not make logging a after the fact exercise. Logging is too important to not think about. And to not push for as a Dev or OPS engineer you need to know what you application is doing so when that time comes and it does something you did not expect you can look in the logs and say that is why it went wrong, not mmmh i do not see anything.
Backend Engineering | Performance Optimization | Scalable Systems
4 年Thanks for sharing. Regarding log levels, I find this article (https://tuhrig.de/my-logging-best-practices/) very interesting.