What to log

What to log

Quick over view.

All Applications that you write should have good logging. But what is good logging? Let’s start with a few No Brainers. Your applications uptime is always more important than your logging. If the data that you want to write is more important than it should not be in a log but in a database of some sort. Logging is only transient data.

We can define 3 types of logging:

  • Application log: All application failures, start , stop's with their reasons. (See log levels)
  • Audit logs: All functional actions done on behave of consumers, with the following data who, when and what.
  • Access logs: HTTP based log what ip connected to what end point at what time and with what code was answered.

LogTypes

Audit

Why Log Audit Logging ? Inevitably, someone asks why event data should be logged on a given system. Essentially there are four categories of reasons:

  • Accountability – Log data can identify what accounts are associated with certain events. This information then can be used to highlight where training and/or disciplinary actions are needed.
  • Reconstruction – Log data can be reviewed chronologically to determine what was happening both before and during an event. For this to happen, the accuracy and coordination of system clocks are critical. To accurately trace activity, clocks need to be regularly synchronized to a central source to ensure that the date/time stamps are in synch.
  • Intrusion Detection – Unusual or unauthorized events can be detected through the review of log data, assuming that the correct data is being logged and reviewed. The definition of what constitutes unusual activity varies, but can include failed login attempts, login attempts outside of designated schedules, locked accounts, port sweeps, network activity levels, memory utilization, key file/data access, etc.
  • Problem Detection– In the same way that log data can be used to identify security events, it can be used to identify problems that need to be addressed. For example, investigating causal factors of failed jobs, resource utilization, trending and so on.

What to Log? Essentially, for each system monitored and likely event condition there must be enough data logged for determinations to be made. At a minimum, you need to be able to answer the standard who, what and when questions. 

  • Was an Authentication request a success or a failure
  • Was an Authorization request a success and failure

Both of these questions can be asked for a individual or in more general terms of how many did failed.

Retention Period of Audit logging. What is a normal time to keep audit logs , I think 90 days should be enough for problems to be noticed and researched, but every industry has their own needs so it could be a lot longer.

Application

Which events to log

The level and content of security monitoring, alerting and reporting needs to be set during the requirements and design stage of projects, and should be proportionate to the information security risks. This can then be used to define what should be logged. There is no one size fits all solution, and a blind checklist approach can lead to unnecessary "alarm fog" that means real problems go undetected. Where possible, always log:

  • Input validation failures e.g. protocol violations, unacceptable encodings, invalid parameter names and values
  • Output validation failures e.g. database record set mismatch, invalid data encoding
  • Session management failures e.g. cookie session identification value modification
  • Application errors and system events e.g. syntax and runtime errors, connectivity problems, performance issues, third party service error messages, file system errors, file upload virus detection, configuration changes
  • Application and related systems start-ups and shut-downs, and logging initialization (starting, stopping or pausing)
  • Use of higher-risk functionality e.g. network connections, addition or deletion of users, changes to privileges, assigning users to tokens, adding or deleting tokens, use of systems administrative privileges, access by application administrators, all actions by users with administrative privileges, access to payment cardholder data, use of data encrypting keys, key changes, creation and deletion of system-level objects, data import and export including screen-based reports, submission of user-generated content - especially file uploads

Optionally consider if the following events can be logged and whether it is desirable information:

  • Sequencing failure
  • Excessive use
  • Data changes
  • Fraud and other criminal activities
  • Suspicious, unacceptable or unexpected behaviour
  • Modifications to configuration
  • Application code file and/or memory changes

log levels

INFO

  • All important information that we need for normal operations.
  • All interesting global information on performance and trends. This logging should be minimal. Typical information that is interesting: Application start-up and shutdown information
  • User logging on, user logging off
  • Pages (requestUri's) being accessed
  • Performance of service calls, indicating if the result was retrieved from a cache.

WARN 

All information about things that are going wrong but do not need intervention from humans. Something is not as it should be, but everything does work. For example, in case fall back content is send back.

ERROR 

All information about things that are going wrong that do need human intervention. One or several sessions (users) are impacted. For example, a service call timed out or a page cannot be found.

DEBUG 

All actions done by humans when using the application. Logging that should help a administrator to determine the cause of an error. All logging should be understandable and relevant for administrators.

TRACE 

All actions done by the application. All other logging, which should be understandable and relevant for developers. Examples are method entries and exits, results returned from services and databases. This is the only level at which stack traces are allowed. A stack trace at any other level is a program error.

FATAL

Everyone is affected, the entire application is not working. For example, application properties are not present.

Retention What is a normal time to keep Application logs: 90 Days online & 1 year offline for all technical logs, but every industry has their own needs so it could be a lot longer.

Access log

An access log is a list of all the requests for individual files/endpoints that have been requested from a API. The access logs can offer a great deal of information regarding the incoming requests to your API If you need to analyse these logs in large amounts then it may be beneficial to use a log analysis tool that can “crunch the numbers” for you much faster. Example: 127.0.0.1 - peter [9/Feb/2017:10:34:12 -0700] "GET /sample-image.png HTTP/2" 200 1479

Retention

What is a normal time to keep Application logs: 90 Days online & 1 year offline for all technical logs, but every industry has their own needs so it could be a lot longer.

Conclusion

Whatever you do think about it in the design phase, do not make logging a after the fact exercise. Logging is too important to not think about. And to not push for as a Dev or OPS engineer you need to know what you application is doing so when that time comes and it does something you did not expect you can look in the logs and say that is why it went wrong, not mmmh i do not see anything.

Anas Anjaria

Backend Engineering | Performance Optimization | Scalable Systems

4 年

Thanks for sharing. Regarding log levels, I find this article (https://tuhrig.de/my-logging-best-practices/) very interesting.

回复

要查看或添加评论,请登录

Marcel Koert的更多文章

  • Paying for views/advertisement for your youtube channel is that bad.

    Paying for views/advertisement for your youtube channel is that bad.

    The Debate Over Paid Views and Advertising on YouTube: A Balanced Perspective YouTube is an ever-expanding universe of…

  • Emphasizing Developer Experience in DevOps

    Emphasizing Developer Experience in DevOps

    In the realm of DevOps, the focus has traditionally been on streamlining processes, automating workflows, and enhancing…

  • Rise of Internal Developer Platforms

    Rise of Internal Developer Platforms

    The Rise of Internal Developer Platforms: A Comprehensive Guide for DevOps Engineers In the dynamic realm of software…

  • The Hype About Platform Engineering: Echoes of the SRE Revolution

    The Hype About Platform Engineering: Echoes of the SRE Revolution

    In the world of modern software development, buzzwords come and go, but some stick long enough to redefine the way we…

  • Openshift V Kubernetes

    Openshift V Kubernetes

    OpenShift and Kubernetes are both popular container orchestration platforms used in the deployment and management of…

  • Human biases in SRE

    Human biases in SRE

    Human biases can have a negative impact on reliability in an IT organisation by influencing decision-making…

  • The Devaluation of SRE

    The Devaluation of SRE

    The Devaluation of SRE: When Operations Gets a New Label In recent years, Site Reliability Engineering (SRE) has…

    9 条评论
  • Building reliability

    Building reliability

    Building reliability into a microservices environment requires a comprehensive approach that encompasses various…

    1 条评论
  • Certification V Experience

    Certification V Experience

    The debate between certification and experience revolves around the question of what holds more value in the…

  • SLO, SLI & SLA in SRE

    SLO, SLI & SLA in SRE

    In Site Reliability Engineering (SRE), Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service…

社区洞察

其他会员也浏览了