Synthetic Transactions

Synthetic Transactions

At Google, we call them probers; at Microsoft, they are called runners; more generically, they are synthetic transaction health checks – components that perform behavioural testing1 on a service from the outside. They are a vital part of modern service design, and I would never ship a service without one – but not for the reason most people think.

First, let's talk about what makes a good synthetic transaction. Imagine you have developed a website for a bank which includes functionality to check your balance, deposit checks, and transfer money. A complete test suite for a service like this will test all boundary conditions, exercise the service at scale, and more. But a good synthetic transaction will run through a simple critical user journey end-to-end. It should be simple, reliable, and fast. In this case, we might write code to make a deposit, check the balance, and withdraw the money. Performing these operations reliably over and over again is actually a lot harder than it sounds2. Synthetic transaction frameworks are typically stateless and can run in parallel. Developers need to account for this in their code and deal with failures (for example, what happens if the service crashes and restarts after depositing the funds – the next iteration will need to check for this and reset the account before proceeding).???

Once developed, the synthetic transactions should be deployed to production in one or more regions (using separate infrastructure from your primary service) and set to run periodically. Monitoring rules must then be configured, and alerts created to notify the on-call engineer of any failures. Integrating synthetic transaction monitors into a service's automated rollout system is also generally a good idea.

Done, right? Not even close. While synthetic transactions are a vital part of a service's monitoring story, they should not be thought of as the primary way we monitor our services for several key reasons:

  • Synthetic transactions don't run nearly often enough to find issues before your customer is impacted. Most systems default to running health checks every 5 to 60 minutes and generally don't alert until at least two consecutive runs have failed. You are likely to have accrued a lot of angry customers during this time – and lost vital time debugging the issue.??
  • It is not immediately clear who should be called into the incident for large services when a synthetic transaction fails. By their definition (or at least the one I gave), synthetic transactions should be running end-to-end critical user journeys that span multiple microservices (each of which may have its own on-call rotation). Thus, you usually need to call the entire team to troubleshoot the issue when it fails. More often than not, the first half of the call is spent trying to debug the synthetic transaction itself and not identifying the actual impact on the service. This wastes valuable time and results in a lot of angry developers.
  • Simulators (a crucial part of synthetic transactions) are great – but they are no substitute for the real thing. The real world is messy – and failures are often caused by something you haven't anticipated (a new iOS release, a change in how a browser encodes tokens, etc.). More than once, I have had an engineer on a live-site call tell me the service is "just fine" because the synthetic transactions are passing – even when I show them a live repro on my device. The reverse is also true – simulators have bugs unrelated to the product and inevitably produce false positives, which reduces team trust.??

For all these reasons, I advise teams not to rely on synthetic transactions as the primary alerting solution for their service. Instead, I prefer to think of synthetic transactions as "traffic generators" (basically a way of ensuring there is at least some load on the service) and use 'inside' alerts at the microservice level to trigger on-call notifications. In our bank website example, this would mean creating monitors and alerts on the individual services and back-end APIs (e.g., success rate of the deposit API, the latency of the check deposit API, etc.) When done right, these targeted alerts help ensure fast routing to the correct team and speed up troubleshooting. Synthetic transactions can be counted on to ensure there is always some load on the system so that new or under-used services still have traffic to trigger inside alerts. Alerts based on the synthetic transactions should still be configured, but they should be thought of as a 'last line of defense,' and repair items should be created to improve alerts at the microservice level anytime they are triggered.

My "great" contribution to the service my team is getting ready to launch (other than drawing boxes and arrows on a whiteboard and finding engineers to own each of the boxes) was to write the initial set of prober health checks for the service. I had a lot of fun, and it was an excellent way to delve into the details that are so easy to gloss over as an architect. Among its engineers, Google has a reputation for having a tool for everything (actually, there are usually three tools for everything: the deprecated one, the "not yet supported, but will be really cool when it is done" one, and a third, developed by a brilliant engineer that got frustrated and wrote her own after trying the other two). So, true to form, Google has a ready-made system for developing and running synthetic transactions. As a result, creating my prober mostly entailed stitching together building blocks. In a few days, I was able to build and deploy a component to automatically simulate a critical user journey from 12 undisclosed locations around the world.?

In summary, probers/runners/synthetic transactions are critical to your service monitoring story. They must be simple, fast, and, most importantly, reliable. And while you should have monitors and alerts configured on the synthetic transactions themselves, they need to be thought of as a final line of defense for your service.??

Be Happy!

Like this post? Please consider sharing, checking out my other articles, and subscribing to my weekly Flegg's Follies newsletter for more articles on software engineering and careers in tech.

Footnotes:

  1. Traditionally, the industry uses the term black-box testing to refer to behaviour testing services from the outside. The term comes from electrical engineering and refers to not being able to see the components inside the outer casing (as opposed to white-box testing, where the tester has complete visibility of the inner workings of the service). There is still some debate about whether this should be considered a non-inclusive term (unlike the terms whitelist and blacklist, which are problematic because the colour is used as a shorthand for "good" and "bad" and perpetuates concepts that have been used to oppress people of colour). That said, I choose to avoid the term in my writing – why use tech jargon that requires a paragraph of explanation when you can use plain English???
  2. Unfortunately, far too often, synthetic transaction development is an afterthought – and managers tend to put the most junior engineer on the team on the project without much support. This choice usually comes back to haunt them when they get woken up at 2 am in the morning for a false alert caused by a buggy prober.??


Please note that the opinions stated here are my own, not those of my company.

Kathleen Wilson

Senior Program Manager - Microsoft Data & AI

2 年

This is the only way to manage and monitor services, from an outside in perspective.

回复

Thanks for the brilliant write up. Can the Synthetic Transactions be called as functional tests also? Or there is some difference?

回复

I presented this idea to Gates in 2001 and got my ass kicked up and down the conference room for an hour. He was wrong so I just kept trying to explain it in different ways. People in the room said, "you were like a goddamn weeble - he knocked you down and you just kept getting back up!" (was in 'weebles wooble but they don't fall down': https://youtu.be/dFzhjnjXc2o?t=24) Apparently this was the stupidest fucking idea he had ever heard. He was so pissed that when he shouted "F***", spit flew across the table and landed on my glasses. At the end of the meeting, I ran to the men's room and stood over a toilet for 5-10 minutes because I thought I was going to puke. It was a day.

要查看或添加评论,请登录

Brett Flegg的更多文章

  • Getting Old(er)

    Getting Old(er)

    When I first started my professional career, it was hard to envision what it would be like to have a life-long career…

    7 条评论
  • A Tough Year to Graduate

    A Tough Year to Graduate

    Summer internships are wrapping up, and rising seniors1 are heading back to school for their final year. All signs…

    3 条评论
  • The Joys and Sorrows of Soft Delete

    The Joys and Sorrows of Soft Delete

    If you are browsing the ConfigMgr database schema (a perfectly normal Sunday afternoon activity for at least some of…

  • Dress like DJam Day

    Dress like DJam Day

    I am on vacation this week, so just a super short article to remind everyone that this coming Saturday, August 13th is…

    5 条评论
  • Seagull Management

    Seagull Management

    One of the favourite parts of my job that the pandemic took away was the chance to walk through team rooms at the end…

    6 条评论
  • Consistency Checkers

    Consistency Checkers

    In my article on queues, I alluded to one of the mistakes I often see developers make in modern microservices design:…

    1 条评论
  • Optimal Stress

    Optimal Stress

    In this week’s article, I will discuss stress and its relationship to productivity. A couple of important disclaimers:…

  • When to use a Queue

    When to use a Queue

    I conduct many systems design interviews, and I have recently noticed that candidates seem to have an unnatural…

    4 条评论
  • The Sun Never Sets on Software Development

    The Sun Never Sets on Software Development

    Heads-up. If I am interviewing you for an L7 product management position at Google, I will probably ask how you would…

    4 条评论
  • Why Enterprise Software?

    Why Enterprise Software?

    The banner image in today's post popped up in my' memories' feed a couple of days ago. It was taken nine years ago at…

社区洞察

其他会员也浏览了