AI and the end of Integration Testing
Every time there is a fundamental change in how software can be delivered or built, there is a corresponding change in the toolchain and common practices for building it. The desktop had waterfall and manual testing, the cloud had CI/CD and automated testing, and more dependency on telemetry. AI and LLMs will be no different.
We are used to being able to perform both unit and integration tests on software before we release it. Both involve defining some environment and domain that are fixed, that we can test inside of to look for the behaviors we want. Integration testing, particularly with things like large scale distributed services, is notoriously hard because you have to simulate a large environment and many interactions very accurately to get a meaningful and complete result. Programmers still do a lot of “smoke testing” and manual “checking” - because it’s hard to admit that fully integration testing anything is hard. As an industry, we rely on telemetry and user reporting to catch what we miss instead.
AI, particularly more independent agents, is going to finish breaking our capability to do integration testing in advance and is going to usher in a different kind of development pattern, or at least evolutions of the current one.
Why is integration testing going to become impossible? Well, largely because the “aperture” of what is possible for a program (or agent) is going to get essentially infinitely wide. We are already seeing some early attempts at this. The list of an agent can interact with and actions they might take can be essentially infinite - as large as natural language plus all of the APIs they can reach plus the full complexity of the real world. Even if you could set up a full replication of “the real world” as a sandbox to test in, the combinatorial complexity is far too high to test even a representative sample.
What does this mean for testing? How do we build and operate safe software in a world where we can’t test as much as we want? I don’t know - no one fully does - but I think this will drive us to be more telemetry focused, and to build “self-checking” systems that will cost more in compute, but do more checking and correction at runtime, instead of at test time.
领英推荐
This is hard to do - we aren’t talking about measuring things like latency or crashes, but about measuring more “semantic” properties like “safe”, “helpful”, “nice”, “making progress”. These will likely require their own classifiers and inference to do well, which of course is a hard security problem at scale. We will have to learn to use a lot more compute to monitor and manage agents at scale - single prompts can be mostly tested in isolation now but agents that are more complex won’t be testable in the same way.
We call this idea “semantic telemetry” because of that need to test semantic properties in real-time. It’s a challenge! There’s no absolute measure of, say, “helpful”. There can only be examples and rubrics, and, hopefully, a stable relative measure on some fixed scale. It might be the case that we will, as an industry, produce “common behavior rubrics” and start to do things like certify an agent holds to them - hard to say.
There are other testing challenges coming too - regression testing is another one that seems apparent, and similarly complex and in the semantic realm. Because so much of what these might do is open-ended, it will likely be the case that anything that is highly dependent on fixed behaviors of the base model will be too brittle, but there will still be things like verbosity or maybe ‘basic intelligence’ that more complex programs will depend on. How do we specify and test this? Is there a real-time component where we can make predictions about which model will work right and schedule an inference accordingly (“semantic scheduling” and optimization)?
There are likely even more problems that will emerge as larger teams begin trying to build more and more complex programs and agents that use LLM base models and other AI models as programming objects. The era of semantic engineering is beginning, and we will have to find what the new development patterns are that work for it.
Founder and CEO @EnterpriseWeb
1 年Sam Schillace - Spot on! GenAI automation is the inevitable future. Intelligent Agents translating NLP voice or text "intent" into actions. While some use-cases won't require determinism as there is tolerance for creativity/mistakes. There is an opportunity to transform the Enterprise and finally overcome silos and connect value streams end-to-end. However, mission-critical systems still require determinism so GenAI needs to be grounded with domain knowledge (ontology) to ensure safe, governed behavior. Our company premiered the industry's 1st Enterprise-grade generative AI orchestration in May. We demonstrated a developer expressing their intent in an informal, unscripted conversation with our platform via Microsoft Jarvis and ChatGPT powered UI to compose, deploy and manage a complex network service. Our platform provides the context, constraints and capabilities. It uses agents to interpret the NLP inputs and dynamically construct a response based on context. Everything is late-bound. Every interaction is wrapped with security/identity, reliable messaging, transaction guarantees and state management. https://enterpriseweb.com/omdia-covers-enterprisewebs-telco-grade-generative-ai-use-case/
Interesting
Chief Technology Officer at AstrumU
1 年This is an interesting idea that will see take up in the realm of safety critical systems (healthcare, airlines, nuclear power, and secure transaction to name a few), which is where the optimization will emerge to make this an economically viable approach to large scale integration testing. We do truly live on the cusp of a remarkable age for knowledge based systems.