登录查看更多内容

Synthetic Transactions

Brett Flegg

Drawing boxes and arrows on whiteboards and writing documents nobody reads

发布日期: 2022年8月1日

At Google, we call them probers; at Microsoft, they are called runners; more generically, they are synthetic transaction health checks – components that perform behavioural testing1 on a service from the outside. They are a vital part of modern service design, and I would never ship a service without one – but not for the reason most people think.

First, let's talk about what makes a good synthetic transaction. Imagine you have developed a website for a bank which includes functionality to check your balance, deposit checks, and transfer money. A complete test suite for a service like this will test all boundary conditions, exercise the service at scale, and more. But a good synthetic transaction will run through a simple critical user journey end-to-end. It should be simple, reliable, and fast. In this case, we might write code to make a deposit, check the balance, and withdraw the money. Performing these operations reliably over and over again is actually a lot harder than it sounds2. Synthetic transaction frameworks are typically stateless and can run in parallel. Developers need to account for this in their code and deal with failures (for example, what happens if the service crashes and restarts after depositing the funds – the next iteration will need to check for this and reset the account before proceeding).???

Once developed, the synthetic transactions should be deployed to production in one or more regions (using separate infrastructure from your primary service) and set to run periodically. Monitoring rules must then be configured, and alerts created to notify the on-call engineer of any failures. Integrating synthetic transaction monitors into a service's automated rollout system is also generally a good idea.

Done, right? Not even close. While synthetic transactions are a vital part of a service's monitoring story, they should not be thought of as the primary way we monitor our services for several key reasons:

Synthetic transactions don't run nearly often enough to find issues before your customer is impacted. Most systems default to running health checks every 5 to 60 minutes and generally don't alert until at least two consecutive runs have failed. You are likely to have accrued a lot of angry customers during this time – and lost vital time debugging the issue.??
It is not immediately clear who should be called into the incident for large services when a synthetic transaction fails. By their definition (or at least the one I gave), synthetic transactions should be running end-to-end critical user journeys that span multiple microservices (each of which may have its own on-call rotation). Thus, you usually need to call the entire team to troubleshoot the issue when it fails. More often than not, the first half of the call is spent trying to debug the synthetic transaction itself and not identifying the actual impact on the service. This wastes valuable time and results in a lot of angry developers.
Simulators (a crucial part of synthetic transactions) are great – but they are no substitute for the real thing. The real world is messy – and failures are often caused by something you haven't anticipated (a new iOS release, a change in how a browser encodes tokens, etc.). More than once, I have had an engineer on a live-site call tell me the service is "just fine" because the synthetic transactions are passing – even when I show them a live repro on my device. The reverse is also true – simulators have bugs unrelated to the product and inevitably produce false positives, which reduces team trust.??

For all these reasons, I advise teams not to rely on synthetic transactions as the primary alerting solution for their service. Instead, I prefer to think of synthetic transactions as "traffic generators" (basically a way of ensuring there is at least some load on the service) and use 'inside' alerts at the microservice level to trigger on-call notifications. In our bank website example, this would mean creating monitors and alerts on the individual services and back-end APIs (e.g., success rate of the deposit API, the latency of the check deposit API, etc.) When done right, these targeted alerts help ensure fast routing to the correct team and speed up troubleshooting. Synthetic transactions can be counted on to ensure there is always some load on the system so that new or under-used services still have traffic to trigger inside alerts. Alerts based on the synthetic transactions should still be configured, but they should be thought of as a 'last line of defense,' and repair items should be created to improve alerts at the microservice level anytime they are triggered.

My "great" contribution to the service my team is getting ready to launch (other than drawing boxes and arrows on a whiteboard and finding engineers to own each of the boxes) was to write the initial set of prober health checks for the service. I had a lot of fun, and it was an excellent way to delve into the details that are so easy to gloss over as an architect. Among its engineers, Google has a reputation for having a tool for everything (actually, there are usually three tools for everything: the deprecated one, the "not yet supported, but will be really cool when it is done" one, and a third, developed by a brilliant engineer that got frustrated and wrote her own after trying the other two). So, true to form, Google has a ready-made system for developing and running synthetic transactions. As a result, creating my prober mostly entailed stitching together building blocks. In a few days, I was able to build and deploy a component to automatically simulate a critical user journey from 12 undisclosed locations around the world.?

领英推荐

Optimizing Performance

Codingmart Technologies 1 年前

VuNet’s vuOTT – The Future of Observability

VuNet Systems 1 年前

Why legacy tech can’t keep up with modern-day needs

Paymentology 5 个月前

In summary, probers/runners/synthetic transactions are critical to your service monitoring story. They must be simple, fast, and, most importantly, reliable. And while you should have monitors and alerts configured on the synthetic transactions themselves, they need to be thought of as a final line of defense for your service.??

Be Happy!

Like this post? Please consider sharing, checking out my other articles, and subscribing to my weekly Flegg's Follies newsletter for more articles on software engineering and careers in tech.

Footnotes:

Traditionally, the industry uses the term black-box testing to refer to behaviour testing services from the outside. The term comes from electrical engineering and refers to not being able to see the components inside the outer casing (as opposed to white-box testing, where the tester has complete visibility of the inner workings of the service). There is still some debate about whether this should be considered a non-inclusive term (unlike the terms whitelist and blacklist, which are problematic because the colour is used as a shorthand for "good" and "bad" and perpetuates concepts that have been used to oppress people of colour). That said, I choose to avoid the term in my writing – why use tech jargon that requires a paragraph of explanation when you can use plain English???
Unfortunately, far too often, synthetic transaction development is an afterthought – and managers tend to put the most junior engineer on the team on the project without much support. This choice usually comes back to haunt them when they get woken up at 2 am in the morning for a false alert caused by a buggy prober.??

Please note that the opinions stated here are my own, not those of my company.

Flegg's Follies

675 位关注者

Kathleen Wilson

Senior Program Manager - Microsoft Data & AI

2 年

This is the only way to manage and monitor services, from an outside in perspective.

Kapil Jain

2 年

Thanks for the brilliant write up. Can the Synthetic Transactions be called as functional tests also? Or there is some difference?

Jeffrey Snover

2 年

I presented this idea to Gates in 2001 and got my ass kicked up and down the conference room for an hour. He was wrong so I just kept trying to explain it in different ways. People in the room said, "you were like a goddamn weeble - he knocked you down and you just kept getting back up!" (was in 'weebles wooble but they don't fall down': https://youtu.be/dFzhjnjXc2o?t=24) Apparently this was the stupidest fucking idea he had ever heard. He was so pissed that when he shouted "F***", spit flew across the table and landed on my glasses. At the end of the meeting, I ran to the men's room and stood over a toilet for 5-10 minutes because I thought I was going to puke. It was a day.

185 次回应

查看更多评论

要查看或添加评论，请登录

Brett Flegg的更多文章

Getting Old(er)

2022年8月29日

Getting Old(er)

When I first started my professional career, it was hard to envision what it would be like to have a life-long career…

7 条评论
A Tough Year to Graduate

2022年8月22日

A Tough Year to Graduate

Summer internships are wrapping up, and rising seniors1 are heading back to school for their final year. All signs…

3 条评论
The Joys and Sorrows of Soft Delete

2022年8月15日

The Joys and Sorrows of Soft Delete

If you are browsing the ConfigMgr database schema (a perfectly normal Sunday afternoon activity for at least some of…
Dress like DJam Day

2022年8月8日

Dress like DJam Day

I am on vacation this week, so just a super short article to remind everyone that this coming Saturday, August 13th is…

5 条评论
Seagull Management

2022年7月25日

Seagull Management

One of the favourite parts of my job that the pandemic took away was the chance to walk through team rooms at the end…

6 条评论
Consistency Checkers

2022年7月18日

Consistency Checkers

In my article on queues, I alluded to one of the mistakes I often see developers make in modern microservices design:…

1 条评论
Optimal Stress

2022年7月11日

Optimal Stress

In this week’s article, I will discuss stress and its relationship to productivity. A couple of important disclaimers:…
When to use a Queue

2022年7月4日

When to use a Queue

I conduct many systems design interviews, and I have recently noticed that candidates seem to have an unnatural…

4 条评论
The Sun Never Sets on Software Development

2022年6月27日

The Sun Never Sets on Software Development

Heads-up. If I am interviewing you for an L7 product management position at Google, I will probably ask how you would…

4 条评论
Why Enterprise Software?

2022年6月20日

Why Enterprise Software?

The banner image in today's post popped up in my' memories' feed a couple of days ago. It was taken nine years ago at…

See all articles

Synthetic Transactions

Brett Flegg

Drawing boxes and arrows on whiteboards and writing documents nobody reads

领英推荐

Flegg's Follies

675 位关注者

Brett Flegg的更多文章

社区洞察

其他会员也浏览了

Build Sovereign AI Cloud with InfraCloud AI Consulting Services

Every Financial Institution Needs Containers — Here’s Why

StackSpot News #11 - It's time to focus on scalability

Amazon is ‘doubling down’ on digital ID credentials, and maybe Worldcoin is on the right track

Discover the latest and greatest | May 2024

Case Study ThreatMark

Secure Finance: Unleashing AI with Multi-Party Computation

Sovereign Cloud, Simplified!

November 02, 2023

Financial Services: Use Case Specific Generative AI Offerings in AWS Marketplace (Part 6)

领英推荐

Flegg's Follies

675 位关注者

Brett Flegg的更多文章

Getting Old(er)

A Tough Year to Graduate

The Joys and Sorrows of Soft Delete

Dress like DJam Day

Seagull Management

Consistency Checkers

Optimal Stress

When to use a Queue

The Sun Never Sets on Software Development

Why Enterprise Software?

社区洞察

其他会员也浏览了

Build Sovereign AI Cloud with InfraCloud AI Consulting Services

Every Financial Institution Needs Containers — Here’s Why

StackSpot News #11 - It's time to focus on scalability

Amazon is ‘doubling down’ on digital ID credentials, and maybe Worldcoin is on the right track

Discover the latest and greatest | May 2024

Case Study ThreatMark

Secure Finance: Unleashing AI with Multi-Party Computation

Sovereign Cloud, Simplified!

November 02, 2023

Financial Services: Use Case Specific Generative AI Offerings in AWS Marketplace (Part 6)