The Fastest & Simplest Way To Debug Complex Systems
Unpredictable and intermittent problems are the bane of software development. It can feel like looking for the right needle in a haystack of needles and the situation has gotten considerably worse as technology stacks have evolved. With a classic monolithic LAMP stack, debugging was manageable and often focused on the products of a single vendor. However, software architecture has exploded with new tools, platforms, databases, languages, and frameworks. Microservices, containers, ephemeral instances, schedulers, serverless models, functions as a service, third-party hosted databases, polyglot persistence, platforms connected to other platforms with a variety of glue like APIs and intermediary services; today’s infrastructure is exponentially more complex and interdependent than yesterday’s. In such an environment, it is almost impossible to understand what is happening, especially as you grow.
“We need insight into infrastructure to build product” -Customer
This is precisely the problem that Honeycomb solves. Honeycomb is an event driven debugging platform that provides real-time, interactive introspection for your data at massive scale. It helps you answer unpredictable new questions and is the go-to place to learn about problems in your architecture.
“Honeycomb gives us the ability to understand every aspect of our system in a data-driven way and allows us to evaluate our system from basically any angle” — Customer
The Promise of Real-Time Observability
In control theory, observability is a measure for how well internal states of a system can be inferred by knowledge of its external outputs. The observability and controllability of a system are mathematical duals. As systems outpace our ability to predict what will happen, real-time observability is critical to building predictable and manageable systems.
Honeycomb achieves this by focusing on event-driven instrumentation rather than just relying on metrics, logging or exception tracking. Event objects are created that hold all the data attributes to be recorded. These events can be analyzed to quickly locate and identify problems. There is no performance penalty, even with tens of thousands of attributes. This exploratory debugging provides the ability to sort through vast quantities of logs in a rapid manner while preserving the ability to look at the raw data.
“Any company that has complex data will benefit from Honeycomb” -Customer
“As engineers, we really need the ability to explore data instead of looking at someone else’s exploration” — Customer
Honeycomb Begins where Monitoring Ends
Today, engineers have several tools to monitor a company’s applications and infrastructure including APM products, metrics platforms, and log management solutions. These are fantastic products that can help the development teams manage their systems. Unfortunately, it is difficult to use them to keep up with the increasing scale and complexity of modern environments, which introduces unpredictability.
Log Aggregation is where debugging starts since that is where the raw data sits. But log aggregation involves a lot of string processing (not getting any faster), regexps (not getting more maintainable), and the need to predictively index on anything you might want to search on (or you’re straight back to distributed grep). This gets worse with scale and complexity.
“ You can do this with log management tools, but they are hard to run and use. Need to do a lot to get them to work.”- Customer
Monitoring solutions run automated checks to verify that the system is behaving as expected. They are very useful in identifying that there is a problem and generating an alert quickly. While this is critical, it is not enough. The dev team also needs to find out what caused the problem, which is becoming increasingly hard as environments get more complex.
Metrics platforms rely on data points to track what is happening in a system. These data streams have to be predefined and usually fall into buckets such as activity, usage or performance. Again, they are vital in providing a high-level overview of the system, but the team has to predefine what to track. As systems get more complex and cardinality increases, the number of metric dashboards that are being monitored can become unmanageable. Furthermore, to deal with scale and storage costs, metrics are usually bucketed by rollups over intervals, which sacrifices precious detail about individual events and context for root cause analysis.
“Averages are Terrible. The signals they generate are actively misleading” -Customer
Customers do not want an average answer. They want the unique, precise answer to their specific problem. Honeycomb works with existing log aggregation, metrics, and monitoring solutions to provide true observability in modern systems.
“Metrics tools have issues with cardinality. They work great for host-level monitoring. But with query sets with high degrees of cardinality, they fall apart” — Customer
Team
The Honeycomb team has lived through the challenges of building and managing massively complex systems at Parse and Facebook. They have been through that rocketship growth phase into the uncharted territory of serving tens of millions of users, managed chaos repeatedly and know how to bridge the gap with tooling and techniques.
At Storm, we are thrilled to join them on their journey to realize the promise of real-time observability and stop the endless cycles of downtime, fragility, and distractions plaguing dev teams.
“After spending 10 years at Google I thought I had a good handle on the different techniques to see what is going on with services. But Honeycomb changes the game by supporting fast exploration over high cardinality data. It has super powers over traditional monitoring.” — Customer