Synthetic Transactions
At Google, we call them probers; at Microsoft, they are called runners; more generically, they are synthetic transaction health checks – components that perform behavioural testing1 on a service from the outside. They are a vital part of modern service design, and I would never ship a service without one – but not for the reason most people think.
First, let's talk about what makes a good synthetic transaction. Imagine you have developed a website for a bank which includes functionality to check your balance, deposit checks, and transfer money. A complete test suite for a service like this will test all boundary conditions, exercise the service at scale, and more. But a good synthetic transaction will run through a simple critical user journey end-to-end. It should be simple, reliable, and fast. In this case, we might write code to make a deposit, check the balance, and withdraw the money. Performing these operations reliably over and over again is actually a lot harder than it sounds2. Synthetic transaction frameworks are typically stateless and can run in parallel. Developers need to account for this in their code and deal with failures (for example, what happens if the service crashes and restarts after depositing the funds – the next iteration will need to check for this and reset the account before proceeding).???
Once developed, the synthetic transactions should be deployed to production in one or more regions (using separate infrastructure from your primary service) and set to run periodically. Monitoring rules must then be configured, and alerts created to notify the on-call engineer of any failures. Integrating synthetic transaction monitors into a service's automated rollout system is also generally a good idea.
Done, right? Not even close. While synthetic transactions are a vital part of a service's monitoring story, they should not be thought of as the primary way we monitor our services for several key reasons:
For all these reasons, I advise teams not to rely on synthetic transactions as the primary alerting solution for their service. Instead, I prefer to think of synthetic transactions as "traffic generators" (basically a way of ensuring there is at least some load on the service) and use 'inside' alerts at the microservice level to trigger on-call notifications. In our bank website example, this would mean creating monitors and alerts on the individual services and back-end APIs (e.g., success rate of the deposit API, the latency of the check deposit API, etc.) When done right, these targeted alerts help ensure fast routing to the correct team and speed up troubleshooting. Synthetic transactions can be counted on to ensure there is always some load on the system so that new or under-used services still have traffic to trigger inside alerts. Alerts based on the synthetic transactions should still be configured, but they should be thought of as a 'last line of defense,' and repair items should be created to improve alerts at the microservice level anytime they are triggered.
My "great" contribution to the service my team is getting ready to launch (other than drawing boxes and arrows on a whiteboard and finding engineers to own each of the boxes) was to write the initial set of prober health checks for the service. I had a lot of fun, and it was an excellent way to delve into the details that are so easy to gloss over as an architect. Among its engineers, Google has a reputation for having a tool for everything (actually, there are usually three tools for everything: the deprecated one, the "not yet supported, but will be really cool when it is done" one, and a third, developed by a brilliant engineer that got frustrated and wrote her own after trying the other two). So, true to form, Google has a ready-made system for developing and running synthetic transactions. As a result, creating my prober mostly entailed stitching together building blocks. In a few days, I was able to build and deploy a component to automatically simulate a critical user journey from 12 undisclosed locations around the world.?
领英推荐
In summary, probers/runners/synthetic transactions are critical to your service monitoring story. They must be simple, fast, and, most importantly, reliable. And while you should have monitors and alerts configured on the synthetic transactions themselves, they need to be thought of as a final line of defense for your service.??
Be Happy!
Like this post? Please consider sharing, checking out my other articles, and subscribing to my weekly Flegg's Follies newsletter for more articles on software engineering and careers in tech.
Footnotes:
Please note that the opinions stated here are my own, not those of my company.
Senior Program Manager - Microsoft Data & AI
2 年This is the only way to manage and monitor services, from an outside in perspective.
Thanks for the brilliant write up. Can the Synthetic Transactions be called as functional tests also? Or there is some difference?
I presented this idea to Gates in 2001 and got my ass kicked up and down the conference room for an hour. He was wrong so I just kept trying to explain it in different ways. People in the room said, "you were like a goddamn weeble - he knocked you down and you just kept getting back up!" (was in 'weebles wooble but they don't fall down': https://youtu.be/dFzhjnjXc2o?t=24) Apparently this was the stupidest fucking idea he had ever heard. He was so pissed that when he shouted "F***", spit flew across the table and landed on my glasses. At the end of the meeting, I ran to the men's room and stood over a toilet for 5-10 minutes because I thought I was going to puke. It was a day.