Two Things Automated Tests Cannot Do That They Really Can

There is much discussion regarding automated tests, their capabilities, their limitations, and their differences from tests executed by humans. The comparison to human discussion has grown so strong that a particular distinction evolved. Leaving behind decades of referring to automated procedures that measure behavior and compare against an expected outcome as "automated tests", the increasingly popular nomenclature refers to such automated procedures as "checks." When the same thing is done by a human being, it is referred to as a "test". To further avoid the arbitrariness of this distinction, a key qualitative difference is often attributed as necessary for something to be a test. A test is performed to "learn" where learning in this context requires a consciousness to integrate the new information.

This is not a pointless distinction. It forces one to consider many differences that fall in place once we consider "learning" as a necessary component of testing. It is also a rather robust definition when one considers the many reasons why we perform tests. In almost all cases, the information must pass to a human being where a decision is made. Hair-splitting of this nature can be useful, and in this case, I believe it is. But I also believe it can obscure correct understanding.

For the rest of this article, I am going to refer to automated checks sometimes as automated tests. Not out of dismissal of the term (I like the term and go to it when I find it useful to make a point), but instead because it reads more smoothly when I use the term to refer to both automated and non-automated means of execution.

One of the things I see lurking behind the "test" versus "check" distinction is an anxiety that often repeats itself with any human endeavor where prior human activity is replaced with something else. Experts replaced with unskilled labor when processes become rote. Workers native to a region and labor force replaced with immigrants. Internal engineer driven testing processes replaced with in-production quality measurements. And, in the case of what this article is about, human testing activity replaced by automation. In all these cases, a point is made regarding capabilities or attributes of the former which cannot be satisfied by the latter. Sometimes these cases have solid ground. Sometimes they may be true, but the scope is narrow. Sometimes the case has little to stand on or ignores other factors which counter the objection. Behind a lot of the argument, alongside legitimate motivations, is basic anxiety over being displaced or made obsolete. It is an understandable anxiety, and one should not dismiss it with typical capitalist or technologically biased morality. But the key issue is sometimes this anxiety motivates us to put forward arguments or observations that are not true.

There are a two popular arguments about testing which I believe fall into this category:

-         automated tests cannot discover issues or bugs not already anticipated

-         automated tests cannot do exploratory testing

Neither of these are entirely true. There is truth in the statements, but they demand a much closer analysis to separate fact from fiction.

Refuting "Automated tests cannot discover issues or bugs not already anticipated."

My primary job for the last several years has been on the team that builds and manages the automation system for Microsoft Office. Microsoft Office has one of the largest code bases in the world (I believe that for a single product, it may be the largest). The automation system has a bank of 10s of thousands of machines that executive something approaching 10 million automated tests every day. The system has been in use for just over 20 years as of the writing of this article. Approximately 2000-3000 engineers submit jobs to the system on a regular basis. Numerous automated tools trigger automation jobs based on a variety of code pipeline related events. Thousands of failures are reported every day. I am in a position to examine these failures regularly and observe large scale trends.

At or very near the end of a typical test, there is almost always a line that reads something like this:

   this.Log.Assert.AreEqual (expectedValue, actualValue, "Check that the thing I got matches what I wanted.");

The nature of the check varies - AreEqual, AreNotEqual, IsTrue, IsFalse. It is a common test result reporting pattern that is familiar to many testing systems (the entire xUnit family of harnesses has this, the test libraries that ship with Visual Studio work this way...). As any tester would recognize, this line is the "check," the comparison of an actual measurement against an expectation from an oracle. This is the "anticipated failure" that the test is apparently only capable of reporting.

When one samples the failure results from the test automation, most failures are NOT coming from the test assertions. Yes, the checks do fail often. But they are outnumbered by a large margin by other failures that were not encoded in the steps or checked with a test assertion. These failures come as exceptions thrown by underlying code, product assertions that abort the product (in debug runs), crashes in product code, freezes and hangs in product that stop it from responding, failure of the product to display interface elements that were necessary to continue stepwise execution. Add to this list other background checks for memory leaks, UI elements left open after completion of test run, artifacts in the product logs where the product self-reports error state. None of these things were anticipated by the author of the test. None of them were in the oracle (the explicit oracle - I will get to implicit in a bit) that served as the check. And they dominate the test failures.

The truth is, automated tests report unanticipated failures not just a lot, but most of the time. So much of the time, in fact, that product teams are overwhelmed analyzing the failures to decide which ones merit fixing, and which ones do not.

This fact, indisputable in my experience, forces us to re-examine the assertion "Automated tests cannot discover issues or bugs not already anticipated." It is not true as expressed.

One key issue is that there is a difference between "not anticipated" and "not capable of observing or measuring the failure." Anticipated means the possibility was expected beforehand, and we typically deal with important anticipated failures by putting in the explicit check. That leaves us with a huge number of unanticipated failures, and now the question is "am I capable of measuring/observing them?" When a tester writes an automated check to be sure clicking "submit" adds a purchase request to a shopping cart, they are not anticipating that if while that is happening a background thread on the server forces a cache refresh that causes the server to yield an HTTP 500 error and put up an exception page. Will that happen, I don't know - I just made that up out of an infinite number of imaginary things that might go wrong. It is ridiculous to assert the automated test anticipates this condition. Indeed, the detection of this condition is typically trivial. Even if the test code is not reading the HTTP return values (sloppy code - but I think we have all seen it), the fact that the next thing on the UI is not going to be whatever the test needs to do next to keep going is likely to make the whole procedure abort, probably with an exception thrown by whatever libraries are allowing UI manipulation. That is not "anticipation" that is detecting inability to keep going, it is an implicit consequence of a failure that caught the test procedure by surprise.

There is a tendency to say these sorts of failures demand a human come in and know how to implement a check. That is true for some classes of failure detection. Memory leak detection is very intentional, and very difficult to implement without introducing lots of false failures. Similarly, for non-dismissed UI at the end of a test (I welcome arguments saying maybe we shouldn't care - but that is a separate discussion). Likewise, scanning product logs for errors. In addition to the intentional decision to even examine the product log, there is very non-accidental work required to read whatever format and correctly identify and report errors. Yes, there are a lot of anticipated classes of failures that can execute alongside test code not written to look for those failures specifically.

But a huge number of failures are reported, bugs discovered, by more accidental detection. None of the code in the stack anticipates the early aborts, the exceptions, the hangs, the crashes, the completely unanticipated UI state that throw all execution out of whack. Instead, the test harness is generally robust enough to say "hey, this test just bailed, and this is the last information we had when it happened."

So, what are we really saying when we say a test cannot report unanticipated bugs or failures?

I believe we are saying "humans are capable of observing things that software is not. Software is only able to make observations and report results within its capabilities."

For example, I have recently been working on some tests where the automation system kept reporting the test machines had stopped functioning and were being returned to the system. When that happened, all the test logs for the test run were lost because the automation system is no longer able to talk to the components that move test logs to the central archive. It turns out there was a bug in one of the test tools I was using that mistook the actual test harness as code that shouldn't be running and was terminating the process. It was a matter of configuration to stop that problem, but it took me a while to realize what was going on.

This was an error - in this case in the automation libraries themselves - that the system was not capable of reporting, or not reporting completely. It had no way of saying back to me "Something just terminated the test harness." It was an unanticipated failure, but I won't give the system credit on this one - the missing information was just too much. The test system had no way of self-reflecting, of generalizing its thoughts. To think in human terms, if I were to ask you "how is your spleen feeling today?" most people would have absolutely no answer. Except in cases of extreme injury, we are not consciously aware of our spleens. This problem to the automation was kind like our relationship to our spleen.

This is a case where the human ability to report the failure exceeds the software. My perceptions generalize. I can apply a broader awareness of how the system works. Was it a timeout? Was there interference with the communication channels? Was CPU utilization keeping the machine inaccessible during heartbeat requests? I can imagine failure modes either familiar or brand new. I can ask the software to change the way it executes, ways that the computer would likely never construct because they are derived from my ability to generalize knowledge, and the things I can observe that the software cannot.

I used this example because it bridges the two sides. It is a case where the failure happened ON the automated test, but the actual reporting exceeded the capabilities of the automated test and a human had to engage to figure out what the failure was.

To me this is the critical distinction. Not only are humans capable of making observations and measurements that the software may not be programmed to do, but humans are also capable of on-the-fly invention of new observation methodologies. Our perception capabilities are so broad, our internalization of information so generalized, that there is no software system coming soon that will eclipse us.

But that is a long way from "cannot discover issues or bugs not already anticipated." The real difference is in the capabilities. Indeed humans, whom we fully prescribe being able to notice the unanticipated, are incapable of numerous observations. A human being is incapable of knowing if their blood sugar is too high (a great many people get the big surprise diagnosis of Type II diabetes when their doctor tells them they are at an approaching lethal blood glucose level over 200 - happened to me). Most human beings cannot accurately distinguish and identify musical intervals, and only about 10% of human beings can accurately identify pitch (we learned when he was 8 years old that my son can do this - freaked us all out). A human being has a very narrow perception of color compared to butterflies. Orangutans have larger and much faster working memory than human beings. For all our stunning and amazing capabilities for learning and observation, our "blind spots" are enormous, and are only aided by technology. Yet, despite such incapability, we still grant ourselves the ability to recognize the unanticipated. I put forward the assertion that the same is true for automated tests, and the real matter is a matter of capability to observe, not inability to deal with the unanticipated.

Countering "Automated tests cannot do exploratory testing."

A simple, naive version of this assertion is that automated tests are deterministic in nature. Steps are pre-defined, happen in a specific or narrowly limited order, and the validation is a check stated ahead of time. This assertion is easily and trivially countered with examples of automated tests where steps are non-deterministic. Such tools are often called "monkey tests," named after the old story of an infinite number of monkeys with an infinite number of typewriters eventually typing out the entire works of William Shakespeare (by the same thinking, they will also type out everything ever written, including Fifty Shades of Gray and Atlas Shrugged, both in my opinion ample justification for shutting down the whole experiment). A counter to rebuttal is to say that such automated tools are not tests if their behavior is not meaningful, or they do not report results that are meaningful. This is likewise refutable by talking about weighted actions, re-training the system toward more desirable behaviors, and generalized oracles that focus more on "bad" then "I expected exactly this state."

There is something true about asserting automated tests cannot do exploratory testing. The less deterministic the automated test behaviors, the less capable that test is to anticipate correct outcome. The state machine becomes too large to reasonably manage. It in fact becomes too large and complicated for a human to anticipate, and while the computer could easily manage the size and complexity, it lacks the capability to know how to define the expectation - which only the human can do. There are tools for this kind of oracle definition (based on a "model-based testing" trend that Harry Robinson rather unintentionally launched years ago) which work fine on small state machines but are rendered nearly unusable when the model complexity exceeds the human capacity to imagine it. But humans are still better at noticing if the expected state is wrong or right after complex sequences are plucked from the ether. Rather than map out all the possibilities at the beginning, they look at the end state and "think about it" - they apply a hunch that is informed by everything from what they know about the product, about software, how they feel about their pet fish, and whether their favorite television program is going to be cancelled. Humans are good discerning "correctness" at times when trying to get a computer to do it is untenable.

But there is also a counter to the objection, and it is like the counter to unanticipated failures. It is relatively trivial to give an automated system the ability to detect known categories of bad. Some of these are more intentionally defined (e.g., "At no time should the charge be approved when credit card expiration date is less than today"), but some of these are implicit (e.g., "during run, CPU rate should be above 10% and below 90%, response latency on all requests should never exceed 1000 milliseconds" or "all HTTP response should be in the 2xx or 3xx range").

Given the capability of a computer to execute non-deterministic steps (a trivial thing to implement in its simple form), to weight that non-determinism toward desirable patterns (non-trivial, and demands a lot of human thinking to craft the weights), to change behavior over time based on comparison to goals (this is real, working now, but it does demand considerable learning model design and considerable computational horsepower to do the training), and to report anomalous behavior, the end result is a form exploration. At this point, the word "exploration" gets squishy. I am going to split into two parts: a) doing something that had not been done prior whose outcome and possibly even act itself is previously unknown, b) making observations about those outcomes.

For example, a deterministic automated test might be:

1.      click the "format" menu

2.      check the "Underline" control

3.      click "OK."

4.      validate that the current selection is formatted underline

A human might do this.

-         Click "format" menu... huh... dismiss that - how does that look? click it again

-         hmmm, italic, underline, double underline - double underline? That is weird, I don't think my printer has a built-in capability for double underline - are any of these fonts printer fonts? I wonder what it is going to look like if I apply double underline to one of these printer fonts.

A non-deterministic automated test might do this.

-         click "format" menu

-         dismiss the dialog

-         click the "format" menu

-         select a font (that happens to be a printer font)

-         check "double underline"

-         click OK

-         click Print

  • ? in the background, check if anything particularly nasty happened... but, what's a printer, oh, wait, I cannot even ask that question because I am incapable of generalized contemplation.

My example here does a bit of dis-service to exploratory testing. It makes it seem entirely ad hoc when in fact exploratory testing is often very much planned, very much based on seeking certain outcomes. I was seeking a simple example that still fits the parameters, so please appreciate that exploratory testing is a sophisticated practice and imagine the human preceded this activity with an intent in mind.

What is the difference between the human and the automated test in the above example? They both happened to execute the same steps. They had different reasons. The human was drawing from prior experience and knowledge (such as being the owner of a very, very old printer...), and was motivated to see what would happen. The automated test was motivated by an algorithm that selects the next action based on some sort of probability.

Will the two report the same failures? Certainly not. If the printout looks "wrong", the human will notice (a common thing with using printer fonts printing with non-printer attributes is to get the spacing wrong and printer over the character, too far away from it, etc.) and at least OUR automated test that has no built in capacity to examine physical printouts (FWIW - lots of people have implemented these - the one in our example doesn't) will not report that failure. But what about "anything particularly nasty happened"? In the case of our human, they may have a list of things they always check, but they may decide not to or forget to, or do it and miss something. The automated test will slavishly check whatever it does and report anything that it has been told looks "wrong".

In terms of the value that exploratory testing brings to the table, I would assert that both the automated test and the human tester bring a unique difference. Exploratory testing executes previously unanticipated steps - the final decision of what to do is sometimes made at point of execution, may invoke previously unanticipated states, and then reports what was found. It is in the execution where the test becomes exploratory.

I anticipate the common addendum of "learning" to the definition of exploratory testing, because it is generally added as a necessary part of testing. Accepting that as a counter to calling the activity a "test" without it, I would assert that it is not the counter to the activity being exploratory. If there are two tests - both done by humans - and one is a rote series of steps, while the other's steps are not pre-defined, the first fails the "exploratory" definition while the latter may fit (more sophisticated analysis warranted to truly define it as such). It is the difference between the way of executing the steps, and thusly dealing with the resulting state, that makes the test exploratory.

The difference goes back to capabilities. The human is capable of different observations while exploring. The human has different motivations for changing the exploratory behavior. The human has internalized the purpose and context of their behavior in a way that is very different from the automated test, and that gives them the ability to report failures the automated test will not. But both the human and the automated test can drive the product to do something perhaps unanticipated, even to bias that behavior to more desirable patterns, and to report measurements of the system state.

The human will learn something. The automated system will not (I consider the definition of learning as used in machine learning to be a separate term, even if it is spelled the same way and sounds the same). But that does not change that automated tests can step away from deterministic test patterns to yield valuable bugs.

Conclusion

And that is the point of this article. When we say, "automated tests cannot discover issues or bugs not already anticipated", and "automated tests cannot do exploratory testing"" we refuse to utilize tools that are powerful and useful. We need to make these statements with far more precision and accuracy, because there remain critical differences between human capability and automated system capability and making the mistake that the automated system cannot do something which they can that we rely on ourselves doing will for sure displace us. What we need to understand is the way our capabilities are critical to proper use of the automated systems, or critical complements to what the automated systems truly cannot do. We need to appreciate that no automated system will be used the right way for the right problems without human expertise applied. Just like industrial revolution era loom weavers of England tossing their shoes in the automated looms, we should recognize where our skills and talents truly distinguish rather than deny what is coming.

P.S Yet Another Thing A Human Can Do That the Automated Test Will Not Without Adding Capabilities

As I type this, I am hitting a weird behavior. As I edit the document, changing awkward wording and nonsense statements, Word does a really fast scroll of the document up to the top, and then down to where I have placed my cursor. The cursor location is intact. Typing happens correctly. The document state is okay. But the rapid scroll and scroll back is visually jarring and confusing. As far as I can tell, nothing is wrong with my document and my editing is not really disrupted. It is bothers me mildly but might really bother or perhaps harm someone with certain kinds of visual impairments or seizure disorders. My hunch is our typical automated tests would never notice such a thing. So long as the automation API reports and behaves such that the test can maintain and check document state, the automated test would probably never report this bug.

Someday, an automated test system will exist that tracks visual state during test execution. Transitions in that state will be compared to transitions in prior "good" versions, and the fast scrolling will be reported as anomalous behavior. Very likely nobody will anticipate such a weird visual bug. More likely they anticipated occluded controls, missing buttons, wrong sized text. But it is not impossible to imagine a visual anomaly detection system that can point out such a bug to us regardless. Some day. When that day comes, I hope I am employed in a way that does not rely solely on my ability to report such bugs, because the computer will outrun me more than a thousand-fold. I hope, instead, that I know how to bring my prior expertise to use that system more effectively.

Joe Phan

Quality Awareness at Actuality One Limited

3 年

What a great article, I wish many testers can reach this.

回复

I appreciate your writing this article, Wayne Roseberry. I'd like to ask a couple of questions. Let's start with this one: what are the things that humans and machinery do such that we could call them "the same thing"?

Ajay Khandelwal

Managing Director (Product and Engineering)| Vision, Strategy and Execution

3 年

Thanks for the great article, Wayne. What's your thought on the autonomous testing approach. The nuance of automated vs autonomous. I think of computerised/automated tests where test cases and fail/pass assertions are defined and coded by SDE vs autonomous, being an unsupervised test. Very similar to the reinforcement learning model driving the autonomous vehicle.?

回复
SOUMEN S.

Author, Technical Leader & Manager @ Tech Companies | Software Development Methodologies

3 年

Wayne: the automation suite is created to serve an altogether different purpose, namely Release Velocity. Automation suite is not there for learning -- there is no alternative to exploratory testing. Testing is an epistemic activity -- automation does not belong in this class of activity. It is deplorable that managers equate automation with testing since their sole concern is Velocity (frequently at the expense of Quality). The truth of the matter is that in most agile teams, testers are not testing anymore. Manual testing has lost its virtue, thanks to development practices and cultures such as agile and DevOps, which have created a divide in the QA space - those who can code and those who can’t. You’d often hear things like, “I’m a 100% automation engineer”, or “80% automation 20% manual”, or even worse, “I hate manual testing”. Shocking! In DevOps, we are led to believe that everything should be automated. There is no place for manual intervention, e.g. manual testing. Nowadays, most testers in an agile team struggle to keep up with the “Test Automation” demand.?There is pressure to automate every story in the sprint, and there is not enough time for thorough exploratory testing.

  • 该图片无替代文字
Andrejs Doronins

SDET and course instructor on Java, Test Automation and Code Quality | I help SDETs skill up

3 年

Thank you, a true breath of fresh air! "Behind a lot of the argument, alongside legitimate motivations, is basic anxiety over being displaced or made obsolete" - yes, this has been my impression from reading posts by QA people for a while now.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了