Logging & It's Best Practices for DevOps

Logging & It's Best Practices for DevOps

Logging is a critical aspect of DevOps for monitoring, troubleshooting, and maintaining system health.

But Let's first understand what EXACTLY is logging?

Logging:

Logging is the process of recording events, messages, or other information about the operation of a system or application.

This information, often called logs, helps developers, system administrators, and other stakeholders understand what the system is doing, diagnose problems, and monitor the system's performance and behavior.


Now, let's learn about the Types of Logs, Log Levels, Log formats, and famous Log Libraries ??

Types of Logs

Event Logs

Record significant occurrences in the system.

[2024-06-03 08:00:00] INFO: System boot initiated.
[2024-06-03 08:00:10] INFO: System boot completed successfully.
[2024-06-03 20:00:00] INFO: System shutdown initiated by user admin.
[2024-06-03 20:00:05] INFO: All services stopped successfully.
[2024-06-03 20:00:10] INFO: System shutdown completed.        

Error Logs

Capture errors and exceptions that occur during runtime.

[2024-06-03 12:00:00] ERROR: Application crash detected: NullPointerException at com.example.MyApp.main(MyApp.java:42).
[2024-06-03 12:05:00] ERROR: Unable to connect to the database. SQLException: Connection refused.
[2024-06-03 12:10:00] ERROR: FileNotFoundException: Config file '/etc/myapp/config.json' not found.
[2024-06-03 13:00:00] ERROR: Disk space critically low on /dev/sda1. Only 5MB available.        

Transaction Logs

Track transactions or business operations.

[2024-06-03 09:00:00] INFO: Transaction ID: 1234567890, User: john.doe, Type: Purchase, Amount: $100.00, Status: Success, Payment Method: Credit Card, Timestamp: 2024-06-03 09:00:00
[2024-06-03 09:05:00] ERROR: Transaction ID: 1234567891, User: jane.doe, Type: Withdrawal, Amount: $200.00, Status: Failed, Reason: Insufficient Funds, Timestamp: 2024-06-03 09:05:00
[2024-06-03 09:10:00] INFO: Transaction ID: 1234567892, User: john.doe, Type: Refund, Amount: $50.00, Status: Success, Original Transaction ID: 1234567888, Timestamp: 2024-06-03 09:10:00
[2024-06-03 10:00:00] INFO: Order ID: 9876543210, User: alice.smith, Items: [{"item_id": "1234", "quantity": 2}, {"item_id": "5678", "quantity": 1}], Total Amount: $150.00, Status: Placed, Timestamp: 2024-06-03 10:00:00
[2024-06-03 12:00:00] INFO: Order ID: 9876543210, User: alice.smith, Status: Delivered, Delivery Date: 2024-06-03, Timestamp: 2024-06-03 12:00:00        

Audit Logs

Keep track of access and changes to data for security and compliance.

[2024-06-03 09:00:00] INFO: User john.doe logged in. IP: 192.168.1.100, Session ID: abc123def456
[2024-06-03 10:00:00] INFO: User jane.doe changed password. IP: 192.168.1.101
[2024-06-03 11:00:00] INFO: Admin admin1 created user account for alice.smith. Role: User, IP: 192.168.1.102
[2024-06-03 12:00:00] INFO: Admin admin2 changed role for user bob.jones to Admin. IP: 192.168.1.103
[2024-06-03 13:00:00] INFO: Admin admin1 disabled user account for carol.white. IP: 192.168.1.104
[2024-06-03 14:00:00] INFO: User john.doe accessed file /documents/report.pdf. IP: 192.168.1.100        

Log Levels

Trace

Fine-grained informational events, typically only valuable during development.

{"type":"Error","message":"no space available for write operations","stack":"Error: no space available for write operations\n    at Object.<anonymous> (/home/ayo/dev/betterstack/demo/nodejs-logging/index.js:21:3)\n    at Module._compile (node:internal/modules/cjs/loader:1254:14)\n    at Module._extensions..js (node:internal/modules/cjs/loader:1308:10)\n    at Module.load (node:internal/modules/cjs/loader:1117:32)\n    at Module._load (node:internal/modules/cjs/loader:958:12)\n    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:81:12)\n    at node:internal/main/run_main_module:23:47"},"msg":"Disk space critically low"}        

Debug

Detailed information used to diagnose issues.

[2024-06-03 10:00:00] DEBUG: Starting process to handle user profile request. UserId: 12345
[2024-06-03 10:00:08] DEBUG: Response payload assembled. UserId: 12345, Payload: { "userName": "John Doe", "email": "[email protected]", "orders": [...] }        

Info

Informational messages that highlight the progress of the application.

[2024-06-03 10:00:00] INFO: Received request for user profile. UserId: 12345        

Warn

Indicate potential problems or non-critical issues.

[2024-06-03 10:00:01] WARN: Deprecated API endpoint accessed. Endpoint: /v1/user/profile, UserId: 12345
[2024-06-03 10:00:03] WARN: High memory usage detected. CurrentUsage: 85%, Threshold: 80%
[2024-06-03 10:00:04] WARN: Fallback to secondary data source due to primary source failure. DataSource: UserCache, UserId: 12345        

Error

Capture error events that might still allow the application to continue running.

[2024-06-03 10:00:00] ERROR: Database connection failure. Database: UserDB, Reason: Connection timed out, UserId: 12345
[2024-06-03 10:00:01] ERROR: Failed to authenticate user. UserId: 12345, Reason: Invalid credentials, IP: 192.168.1.100
[2024-06-03 10:00:02] ERROR: Unable to fetch user details. UserId: 12345, Service: UserService, Reason: Service unavailable
[2024-06-03 10:00:03] ERROR: Exception thrown during order retrieval. UserId: 12345, Exception: NullPointerException at OrderService.getOrder(OrderService.java:45)
[2024-06-03 10:00:04] ERROR: Payment processing failed. TransactionId: 987654321, UserId: 12345, Reason: Insufficient funds
[2024-06-03 10:00:05] ERROR: Data integrity violation. Entity: UserProfile, UserId: 12345, Reason: Duplicate entry for key 'email'
[2024-06-03 10:00:06] ERROR: Unauthorized access attempt. UserId: 12345, Endpoint: /admin/settings, IP: 192.168.1.101
[2024-06-03 10:00:07] ERROR: File upload failed. UserId: 12345, Filename: profile.jpg, Reason: File size exceeds limit
[2024-06-03 10:00:08] ERROR: System out of memory. Action: Saving user session, UserId: 12345, AvailableMemory: 50MB
[2024-06-03 10:00:09] ERROR: Critical configuration missing. ConfigKey: smtp_server, UserId: 12345, Service: EmailService        

Fatal

Severe error events that lead to the termination of the application.

[2024-06-03 10:00:00] FATAL: Critical system failure. Reason: Out of memory. Application terminating.
[2024-06-03 10:00:01] FATAL: Unhandled exception in main application thread. Exception: java.lang.OutOfMemoryError: Java heap space
[2024-06-03 10:00:02] FATAL: Database corruption detected. Database: UserDB, Reason: Inconsistent state in critical tables.
[2024-06-03 10:00:03] FATAL: Failed to initialize core services. Service: AuthService, Reason: Configuration file missing.
[2024-06-03 10:00:04] FATAL: Security breach detected. Immediate shutdown initiated to protect data integrity.
[2024-06-03 10:00:05] FATAL: Hardware failure detected. Component: Disk 1, Server: ProdServer01, Action: System halt.
[2024-06-03 10:00:06] FATAL: Irrecoverable error in transaction processing. Transaction ID: 987654321, Reason: Null pointer dereference.
[2024-06-03 10:00:07] FATAL: System integrity compromised. Reason: Unauthorized modification of core files detected.
[2024-06-03 10:00:08] FATAL: Critical configuration missing. ConfigKey: database_url, Service: MainApp, Action: Shutting down application.
[2024-06-03 10:00:09] FATAL: Catastrophic failure. Reason: Kernel panic, Server: ProdServer01, Action: Immediate reboot.        

Log Formats

Plain Text

Simple and human-readable, but lacks structure.

[2024-06-03 10:00:00] INFO: User john.doe logged in. IP: 192.168.1.100
[2024-06-03 10:00:01] WARN: Slow response from Auth Service. UserId: 12345, ResponseTime: 1200ms
[2024-06-03 10:00:02] ERROR: Failed to authenticate user. UserId: 12345, Reason: Invalid credentials, IP: 192.168.1.100
[2024-06-03 10:00:03] FATAL: Critical system failure. Reason: Out of memory. Application terminating.        

Structured Logs

Use JSON, XML, or similar formats to ensure logs are machine-readable and easily searchable.

JSON Example:

[
  {
    "timestamp": "2024-06-03T10:00:02Z",
    "level": "ERROR",
    "message": "Failed to authenticate user",
    "userId": 12345,
    "reason": "Invalid credentials",
    "ip": "192.168.1.100"
  },
  {
    "timestamp": "2024-06-03T10:00:03Z",
    "level": "FATAL",
    "message": "Critical system failure",
    "reason": "Out of memory",
    "action": "Application terminating"
  }
]        

XML Example:

<logs>
  <log>
    <timestamp>2024-06-03T10:00:02Z</timestamp>
    <level>ERROR</level>
    <message>Failed to authenticate user</message>
    <userId>12345</userId>
    <reason>Invalid credentials</reason>
    <ip>192.168.1.100</ip>
  </log>
  <log>
    <timestamp>2024-06-03T10:00:03Z</timestamp>
    <level>FATAL</level>
    <message>Critical system failure</message>
    <reason>Out of memory</reason>
    <action>Application terminating</action>
  </log>
</logs>        

Binary Logs

These are more efficient but harder to interpret without specific tools.

00000000  5b 32 30 32 34 2d 30 36  2d 30 33 54 31 30 3a 30  |[2024-06-03T10:0|
00000010  30 3a 30 30 5a 5d 20 49  4e 46 4f 3a 20 55 73 65  |0:00Z] INFO: Use|
00000020  72 20 6a 6f 68 6e 2e 64  6f 65 20 6c 6f 67 67 65  |r john.doe logge|
00000030  64 20 69 6e 2e 20 49 50  3a 20 31 39 32 2e 31 36  |d in. IP: 192.16|
000000c0  20 45 52 52 4f 52 3a 20  46 61 69 6c 65 64 20 74  | ERROR: Failed t|
00000120  30 30 0a 5b 32 30 32 34  2d 30 36 2d 30 33 54 31  |00.[2024-06-03T1|
00000130  30 3a 30 30 3a 30 33 5a  5d 20 46 41 54 41 4c 3a  |0:00:03Z] FATAL:|
00000140  20 43 72 69 74 69 63 61  6c 20 73 79 73 74 65 6d  | Critical system|
00000150  20 66 61 69 6c 75 72 65  2e 20 52 65 61 73 6f 6e  | failure. Reason|
00000160  3a 20 4f 75 74 20 6f 66  20 6d 65 6d 6f 72 79 2e  |: Out of memory.|
00000170  20 41 70 70 6c 69 63 61  74 69 6f 6e 20 74 65 72  | Application ter|        

Logging Frameworks and Libraries

  • Java: Log4j, SLF4J, Logback
  • Python: Logging module, Loguru
  • JavaScript/Node.js: Winston, Bunyan
  • Go: Log, Logrus, Zap

Here are some Logging best practices with DevOps in mind:

  1. Centralized Logging: Use centralized logging solutions like ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Fluentd, or Cloud-native services (AWS CloudWatch, Azure Monitor) to aggregate logs from all systems.
  2. Structured Logging: Implement structured logging to ensure logs are in a consistent and queryable format, such as JSON.
  3. Log Levels: Use appropriate log levels (e.g., DEBUG, INFO, WARN, ERROR, FATAL) to classify the severity and importance of log messages.
  4. Correlation IDs: Use correlation IDs to trace and track requests across distributed systems and microservices, making it easier to follow a request’s journey.
  5. Log Rotation and Retention: Implement log rotation to archive old logs and set retention policies to manage log storage efficiently and comply with regulatory requirements.
  6. Security: Ensure that sensitive information (e.g., passwords, API keys) is not logged. Use encryption and access controls to protect log data.
  7. Real-time Monitoring and Alerts: Set up real-time monitoring and alerting to detect and respond to issues promptly. Use tools like Prometheus, Grafana, or Datadog for this purpose.
  8. Contextual Information: Include contextual information (e.g., user ID, transaction ID) in logs to make debugging easier and more effective.
  9. Performance Impact: Ensure that logging does not significantly impact application performance. Use asynchronous logging where possible.
  10. Compliance and Auditing: Maintain logs that meet compliance and auditing requirements, ensuring traceability and accountability.
  11. Automated Log Analysis: Implement automated log analysis tools to detect anomalies, patterns, and potential issues proactively.
  12. Documentation and Standards: Document logging standards and practices, and ensure that all team members follow them consistently.


WARNING: EXCESSIVE logging might overload disk space or DB or any storage wherever you are storing it and also can significantly increase the cost, so make sure to store only what is required, and auto-clean logs that are not required after a certain time.

Pro tip: Use Time to Live (TTL) if available


Bonus section (since you have read till now)

Container Logging: In Kubernetes environments, container logging is crucial. Use logging agents like Fluentd, Fluent Bit, or Filebeat deployed as DaemonSets to collect logs from each node. Ensure logs from all containers are aggregated, tagged with metadata (e.g., pod name, namespace), and forwarded to a centralized logging system for analysis. Implement log rotation and retention policies at the container level to prevent log files from consuming excessive disk space.


Ready to take your logging to the next level? Start implementing these best practices today to enhance your monitoring and troubleshooting capabilities.

Share your experiences and insights in the comments below!

If you found this article helpful, don't forget to like and share it with your network.


Let's drive better DevOps practices together!

Cheers,

Sandip Das

CHESTER SWANSON SR.

Next Trend Realty LLC./wwwHar.com/Chester-Swanson/agent_cbswan

5 个月

Keep going!.

Ibrahim S

Seeking System Administrator | 2 Azure Certified | Skilled in Docker, Kubernetes, Jenkins, Azure DevOps | Tech Content Creator

5 个月

Insightful! Sandip Das Keep going on this

Vinayak Nawale

Senior Cloud Architect | DevSecOps | 3X AWS Certified |Kubernetes l AWS Cloud Migration | Tech Lead at Umbrella Infocare

5 个月

Very true logging is Very important in production environment

Rizan Buhary

Experienced DevOps & Platform Engineering Expert | 10+ Years of Innovation & Operational Excellence in Platform Engineering | Proven Problem Solver & Results-Driven Professional

5 个月

Great one, thank you so much ??

要查看或添加评论,请登录