Observability isn't just a buzzword; it's the art of understanding the internal workings of a system from its external outputs, like metrics or logs. In the era of microservices, our systems have paradoxically become more complex and challenging.
In this piece, I'll share simple, yet effective hacks from my experience that can enhance observability within your organization or team.
1. Incentivize alert acknowledgment:
- Creating a culture of responsiveness: Establish a norm where each alert is promptly acknowledged. Automatically create a ticket for every alert. This allows the on-call engineer to review each associated ticket at the end of a sprint, helping prioritize actions for the next cycle.
- Encouraging adoption: Integrate this process into your team's performance metrics to foster adherence.
- Practical impact: Imagine your team maintains a microservice with a 95% uptime goal. This translates to no more than 18 P0 alerts per year (365 X 1-0.95 ~ 18), assuming a P0 alert signifies customer downtime. With a diligent approach to acknowledging and acting upon alerts, you'll gradually notice a reduction in trivial alerts, enhancing focus on crucial issues.
- Long-term benefits: An efficient alert system reduces noise and fine-tunes your response mechanisms.
2. Maintain an Updated Run Book:
- The underrated tool: Often seen as tedious, keeping a runbook updated is invaluable. It empowers prompt and effective actions during critical incidents.
- Incorporate into handovers: Make it a routine to discuss alert triggers and the adequacy of the run book during on-call handovers.
3. Design intuitive dashboards:
- Avoid information overload: In a crisis, every minute counts. Avoid cluttering dashboards with excessive graphs and charts.
- Focus on key signals: Feature the four golden signals (latency, errors, traffic, saturation) prominently.
- Logical grouping: Organize your dashboard into sections like upstream info, downstream info, database insights, autoscaling, etc. This speeds up dashboard loading and aids in focusing on relevant data.
- Add contextual layers: Incorporating the release_id into each chart can simplify analysis.
- Deep dive links: Include direct links to logs from the dashboard for more thorough investigations.
4. Consistent logging across microservice:
- Uniformity for efficiency: A standard logging format can significantly streamline debugging processes, eliminating the need to adapt to different log formats.
- Common schema for ingestion: A common logging format allows for a common schema to be used for ingestion and helps create the right index for the search query, resulting in a faster response to results.
I hope my experiences shed some light on making observability less of a puzzle. It's not just about the tools and techniques; it’s about creating a culture that values attention to detail and readiness to adapt.
Stay curious & keep experimenting.