When You Know The Markers You Can Bypass The Rules; Why Behavioural Machine Learning Reigns Supreme
Identifying Web Traffic with Behavioural Machine Learning
At some fundamental level, no-one really understands machine learning. Which is a good thing. In fact, if you could manually write the model you had just trained, it’s a reliable indicator you are not doing machine learning correctly.
If we consider a machine can ‘think’ in 1,000 dimensions, while most of us struggle with any more than three, we start to see the staggering potential of using machine learning to solve complex (bot) issues.
At Netacea we focus on behavioural machine learning to understand the actual behaviour of visitors through a website.
The machine learning can monitor all the site visits to a particular path, and analyse them in context relative to each of the other visitors to your enterprise estate. Building this relational matrix to show the clustering of comparative behavioural types is key to our product vision. We examine thousands of potential signals and the way they compare to each other, to produce a true multi-dimensional data model.
Contrast this with a traditional security fingerprint. Fingerprinting is far from new for the collection of evidential ownership. Chinese records from the Qin Dynasty (221-206 BC) include details about using fingerprints as evidence during burglary investigations.
The Limitations of Digital Fingerprints
The digital version - analysing a particular attribute of a PC, keyboard or mouse input, using JavaScript also has a long pedigree. The JavaScript needs to be inserted into all pages, across both the web and mobile devices and will inevitably cause some latency and deployment issues.
JavaScript can simply be disabled at the browser, and although only 1% of users typically run their browsers with JavaScript disabled, many more are selective about which sites they allow JavaScript to run on. JavaScript is also subject to being spoofed and reverse engineered, which is one reason why fingerprint based technologies boast an ever-larger array of fingerprint types and methods they embrace.
Adding script obscuration, increasingly sophisticated back-end authentication to prevent the clients spoofing adds to the JavaScript payload. Increasing the fingerprinting types and the authentication methods undoubtedly does improve detection rates, but inevitably means adding more load as the scripts need to become increasing sophisticated and complex.
At Netacea we focus on the core behavioural intent of the attacks, which can’t be faked. We build up a detailed picture of the core behaviour of the visitors (both human and bot), and then use additional fingerprinting, only when needed to verify our labelled assumptions. This is a fundamentally different approach which gives us great architectural benefits, but relies on the data model to detect the anomalous behaviour we are looking for.
Known Rules Are There To Be Broken
So what does it mean when we say we discover the core behavioural intent of the attacks. How does it work?
Let’s take an analogy; Lance Armstrong’s favourite quote during his seven-year reign over the Tour de France was to repeatedly say he’d never failed a drug test. Armstrong simply exploited the weaknesses in the doping system. His team knew the rulebook and had never tested positive for one of the known lists of banned substances. As the rulebook changed, Amstrong must change and adapt his method to stay one step ahead of the regulations.
This is similar to the current state of the IP digital provenance and fingerprinting services. They look for a known set of bad actors, be they IP based, user agents, or impersonators, once you have a rulebook it becomes much easier to avoid the key tracers for a known set of flags.
Things became tougher for Armstrong when the WADA introduced the biological passport, which monitored the rider’s history over time and then looked for anomalies. This was an important step as it started to test riders randomly over a long period, not just over the course of the race. Armstrong used steroids and EPO out of season to increase muscle mass and aid recovery times but would then stop the usage in time to avoid detection for the actual race testing.
The behavioural passport now meant the use of performance-enhancing drugs was much more difficult to hide. Armstrong had to adopt two new strategies. One simple approach was to make himself very difficult to monitor and hide away in remote locations during the year to prevent the testers getting results that could identify the anomalous behaviour. The behavioural passport is really looking for significant changes in the markers - large changes that can’t be explained naturally, or as a result of a night out on the booze.
The more sophisticated approach was his use of micro-doping with EOP - a naturally occurring hormone produced by the kidney to enrich his red cell count. Artificially boosting the number of red blood cells significantly increases the ability of the body to deliver oxygen to the tissues that need it most. Micro-doping with small amounts of EOP was proven to increase performance, improve recovery time and was undetectable as the biological passport effectively only showed large variances in the markers.
The biological passport was looking for a large standard deviation increase in markers typically showing up when you inject large quantities of EPO. The micro-doping didn’t show an increase. Naturally occurring EPO does vary between person to person and WADA had to be reasonably generous with the thresholds for this natural variance, to avoid false positives which can destroy confidence in the entire system, not to mention destroying a career overnight.
What finally convinced many in the professional cycling community that Armstrong was guilty of EPO doping way before his actual confession, was the comparative behavioural data that showed that he’d actually raced against, and beaten, known confessed dopers who had admitted to doping.
We know doping provides a considerable performance gain over the competition. To beat a proved doper without yourself doing drugs means your performance needs to be 2 or 3 standard deviations higher than the rest of the peloton, who are all professional athletes with roughly the same training, riding with the same technology, weight of bike, the same training standards and the same relentless dedication to winning. The comparative statistic didn’t lie. Outperforming a drugs cheat without taking drugs takes you into the realm of a superman, or a cheat.
Re-writing The Rulebook
Our behavioural engine performs the same core task to identify the bot cheats. We look at the behaviour of all the visitors and then look for the identifying clusters of behaviour from the thousands of dimensions of data that we take - including if necessary any specific fingerprinting markers.
This approach is exponentially more difficult than just fingerprint detection, but gives you the exact comparative data you truly need to identify, catch and permanently mitigate the outliers. Best of all, it shifts the heavy processing to our cloud-based services, and doesn’t require yet another reverse proxy architecture to create risk.
The machine learning intelligence dynamically looks at what normal behaviour is, but critically does this over time, and by path or location within the website.
This allows us to very accurately build our models in context to the actual behaviour. In the case of a Single Page Architecture (SPA) application where the logs aren’t path specific, we still identify the behavioural markers as multiple request API calls which are made over time in the same page.
Evading capture by the behavioural detection methods, means the bots cheats pretending to be human must try and appear to be as human-like as possible. They also must avoid behaving in ways that lead to an obvious discovery of intent. What they can’t do it to create the data model that they need to ensure they can break the rules.
The rulebook doesn’t exist anymore.