The Requirements for a Contextual Big Data Behavioral Analytics Engine
Steve King, CISM, CISSP
Cybersecurity Marketing and Education Leader | CISM, Direct-to-Human Marketing, CyberTheory
Big data is the current rage in cyber security. We all want to stop malware before it strikes in the form of a breach. We all want smart analytics software that can help us reduce the tens of millions of data points into a meaningful spoonful of knowledge so we can act upon it and detect and then deter bad guys from capturing our crown jewels. Ideally, we want software to do all of that for us so we don’t have to have scores of security analysts poring over minute bits of data, pedaling fast but going nowhere
We hear that Big Data can do these things and yet we have seen very little progress so far.
There has been the general perception that throwing lots of network data at a big data engine is the path to identifying and then stopping malicious behavior. There are two significant problems with this theory:
1. A big data analytics tool is only as good as the content from the data sources that feed it, and
2. Analysis without context fails to establish threat relevance and is useless for defense, detection and remediation.
Typical data sources such as log files, NetFlow and baselines are missing all of the key indicators of malicious behaviors, and instead depict activity that appears to the typical data analytics engines as seemingly benign traffic.
As malware continues to evolve and insiders are now operating largely in stealth mode and understanding the constructs of these data analytics engines, fewer and fewer of the recognized indicative data elements are showing up in these logs, flows and baselines.
In addition, today’s coordinated attacks are multi-stage and multi-vector. But because traditional big data analytics examines discrete events out of context they miss the subtle patterns and sequences of related behaviors that cyber-criminals are now using consistently across the global threat landscape to assemble an effective Attack-in-Depth invasion model. Attack-in-Depth is a summary version of the once popular cyber kill-chain model that works by delivering payloads, persisting on endpoints, taking hold across the network and exfiltrating or destroying information assets.
To successfully combat these Attack-in-Depth threats we must shift our approach to contextual data analytics.
An effective contextual analytics engine must be fed the otherwise hidden indicators of malicious behaviors, indicators that are only detected with the right type of analysis.
These analytic engines must use designed algorithms that are constructed to detect both structured and unstructured malicious behaviors within the context of a specific threat envelope. That threat envelope must be informed by patterns of behavior occurring outside the network and across a spectrum of threat landscape external to the operation. And, these engines must be able to operate on this data in real time to identify and isolate an infection after a network has been invaded and before the assets can be breached.
At their core, analytics engines typically follow one of four primary reasoning methodologies:
Deductive Reasoning – Deductive reasoning is based in the theory of deductive inference that draws specific conclusions from general rules e.g., If A = B and B = C, then A = C, regardless of what A or B contain. Deductive reasoning tracks from a general rule to a specific conclusion. If original assertions are true then the conclusion must be true. A fundamental weakness of deductive reasoning is it’s often Tautological (e.g. Malware always contains malicious code) and it is unaffected by contextual inputs, e.g., to earn a master’s degree, a student must have 32 credits. Tim has 40 credits, so Tim will earn a master’s degree, except when he decides not to.
In security analytics, A only equals B most of the time and sometimes it can equal D, so A cannot always equal C, therefore using deductive reasoning as a basis for detection analytics is a flawed way to try and predict the future. You are theoretically guaranteed to be wrong at least once.
In general, common signature-based systems such as IDS/IPS and endpoint security are deductive in nature.
Inductive Reasoning – Inductive reasoning is the opposite of deductive reasoning. Inductive reasoning makes broad generalizations from specific observations. In inductive inference, we go from the specific to the general. We make many observations, discern a pattern, make a generalization, and infer an explanation or a theory.
Where analytics engines are based on inductive reasoning, the resulting analytics resemble probability theory. Even if all of the premises are true in a statement, inductive reasoning allows for the conclusion to be false. Here’s an example: "Harold is a grandfather. Harold is bald. Therefore, all grandfathers are bald." The conclusion does not follow logically from the statements.
This is a better approach than deductive reasoning for projecting the future, but it is obviously imperfect and can produce even more widely varying results.
Advanced IDS/IPS systems use inductive reasoning heuristics to identify malicious behaviors. A heuristic is a rule that provides a shortcut to solving difficult problems and is used when an observer has limited time and/or information to make a decision. Inductive reasoning heuristics lead you to a good decision most of the time, but most of the time is not good enough for advanced threat defense.
Inductive reasoning heuristics are frequently used by contemporary IDS/ IPS systems to generalize the probability of malicious behaviors based on limited input (e.g., known signatures).
Bayesian or Recursive Bayesian Estimation (RBE) Reasoning – This analytic approach is anomaly-oriented and is used in security systems to provide a less tactical view of what’s happened over an extended timeframe (e.g. 30-60 days). Bayesian reasoning is a branch of logic applied to decision making and inferential statistics that deals with probability inference: using the knowledge of prior events to predict future events.
In statistics, “standard deviation” is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A standard deviation close to 0 indicates that the data points tend to be very close to the mean value of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.
In most Bayesian based security analytics, when a result is 3 standard deviations from normal, the system declares it an “anomaly.” The goal of Bayesian Reasoning is to be able to identify a “normal” pattern of behavior by observing subtle fluctuations in activity within the enterprise infrastructure over a period of time to establish a corpus of “prior events”. The result is a baseline which is used as a subsequent “benchmark” against which all network activity and/or behaviors will be measured in the future.
Unfortunately, this baselining concept is flawed and can lead to extraordinary outcomes none of which will result in properly identified threats.
There are three significant problems with this approach:
1. If the network and/or the systems being baselined are already infected before the baseline is created then the baseline establishes a false premise,
2. If an insider is already active on a network, the that insider’s actions will appear as nominal and become part of the “normal” baseline, and
3. Today’s network infrastructure and user behavior is increasingly dynamic, variable and diverse involving many different devices and protocols, access methods and entry points essentially making a baseline assessment impossible without a network lockdown.
Analytics engines that use baselining as their premise for Bayesian Reasoning are prone to extreme volumes of false positives, are cumbersome and difficult to tune and administer, require lots of human attention and frequently miss malicious invasions. In short, they don’t work very well.
Abductive Reasoning – Abductive reasoning is a form of logical inference that derives from an observation to a hypothesis that accounts for the observation, seeking to find the simplest and most likely explanation. In abductive reasoning, unlike in deductive or inductive reasoning, the premises do not guarantee the conclusion. This approach is much better suited to the real world of malicious network attacks.
Abductive reasoning typically begins with an incomplete set of observations and proceeds to the likeliest possible explanation for the set. Abductive reasoning yields the kind of daily decision-making that does its best with the information at hand, which often is incomplete.
A medical diagnosis is an application of abductive reasoning: given this set of symptoms, what is the diagnosis that would best explain most of them? Likewise in our jurisprudence systems, when jurors hear evidence in a criminal case, they must consider whether the prosecution or the defense has the best explanation to cover all the points of evidence. While there may be no certainty about their verdict, since there may exist additional evidence that was not admitted in the case, they make their best guess based on what they know.
While inductive reasoning requires that the evidence that might shed light on the subject be fairly complete, whether positive or negative, abductive reasoning is characterized by an incomplete set of observations, either in the evidence, or in the explanation, or both, yet leading to the likeliest possible conclusion.
A patient may be unconscious or fail to report every symptom, for example, resulting in incomplete evidence, or a doctor may arrive at a diagnosis that fails to explain several of the symptoms. Still, he must reach the best diagnosis he can. Probabilistic abductive reasoning is a form of abductive validation, and is used extensively and very successfully in areas where conclusions about possible hypotheses need to be derived, such as for making diagnoses from medical tests, working through the judicial process or predicting the presence of malware.
Most security solutions today focus on events, processing data from billions of events in an attempt to detect malicious behaviors. This approach is extremely limited in its effectiveness; it fails to scale, it generates tons of false positives and noise, and does virtually nothing to parse and reduce the volume of data that depends on automated heuristic analytics in order to produce any actionable information. Events are interesting but taken alone and out of context, yield very little in the way of useful information while creating a large corpus of data points.
What is needed is an engine that examines evidence and creates infection profiles against specific network entities (servers and endpoints). This is a crucial distinction since a network may contain 1000’s of systems, each generating 1000’s of profiles. In comparison, event-based systems like SIEM must process 100’s of millions of events on the same network infrastructure. This presents security analysts with three – often overwhelming - challenges: event overload, false positives and lack of context.
Many systems use log data in an attempt to discover and detect malware, but logs by their nature are event driven and today’s well-written malware often does not leave a trail in the logs. Traditional log-based data analytics approaches are required to sort through millions of log events in order to correlate those events in any meaningful way. The objective is to somehow convert tens of millions of discrete elements into behavioral patterns.
By pivoting on the events rather than the systems, today’s data analytics engines must join completely independent variables in an attempt to construct meaningful behavior relationships and therefore must treat every event as potentially significant. This requires tremendous processing power (expensive) and by necessity will generate a very low signal to noise ratio triggering as we have said, many false positives and worse than that, false negatives.
In fact, a growing concern with these Bayesian, deductive and inductive reasoning engines is a false negative indicated as an infected system that is determined to be uninfected. To compensate for this tendency, the engines’ sensitivity is frequently tuned to err on the side of caution – thus creating even more false positives.
But as we’ve described, finding the needle in the haystack is only half the challenge. An effective behavioral analytics malware solution also requires context. Network analytics without context is a lot of noise, without any actionable data.
What we need to do is frame the malicious activities in the context of risk. For example, knowing an exploit is targeting a given system has only limited value. Knowing an exploit is targeting a system that is missing a patch which is making it vulnerable to the exploit has extremely high value. The first indicator is noise, but the indicator in the right context is suddenly actionable.
Context refers both to system context and to the real time nature of the contextual evidence. Effective analytics must track real time activities in order to successfully identify and determine true malicious behaviors. Security solutions that rely on sandboxes to detonate potentially malicious payloads on a state and binary configuration alone are by definition out of date by the time the last system is scanned. Malicious attack vectors change in real-time; they don’t remain in state while other vectors are being detonated for analysis.
A successful behavioral analytics engine will correlate real time vulnerability assessments with ongoing threat activity. This involves pulling in Integrity Measurement and Verification (IMV) scans and LDAP (e.g. Active Directory) attributes in real time to deliver the necessary context for security analysts to make decisions.
You want full interoperability with external vendor products through connectors, inbound REST APIs, and industry standard notation (grammar expressions) for attribution based threat information exchange within the security community. And for additional context, you want an engine that integrates daily threat intelligence harvested through honeypots, technology partners, and security advisories published by standards-based organizations (e.g. NIST, MITRE, US-CERT).
Another way to look at this is shifting from North-South to East-West (Lateral) movement within the network. Specifically, we need detection mechanisms in place to detect malicious insider activities and the analytics to discern the actual behaviors, their maliciousness and the context in which they operate.
The way most tools are attempting to identify malicious insiders is by using NetFlow data. NetFlow was originally developed as a network protocol by Cisco to help network engineers plan network infrastructure, optimize performance, and better manage traffic and routing. But there are limitations to NetFlow’s effectiveness as a data source for analytics.
NetFlow was never meant to be a security tool, partly because of the limited information, but also because everything contained therein is historical in nature. There is no real- time aspect to NetFlow. Malicious insiders operate in and must be apprehended in real time!
To address these NetFlow limitations, an effective behavioral analytics engine will have to use a data exchange protocol that operates in real time and creates flow data that is stateful in nature. You want a protocol that can establish an initial record and then generate real time update records during the flow.
To establish context, these records need to contain many attributes beyond basic flow metrics which should include network addresses, service ports, geolocation indicators, data counters, connection states, threat tags, DNS transactions, connection signals (timeouts, resets), etc.
This would be a paradigm shift, pivoting from network access management to flow entropy management, where the operational integrity of internal and cross-realm network activities and data transfers would be able to be examined. You would then be able to detect lateral data movement, signaling (callbacks, beacons, dial homes), and data exfiltration using flow logic based event correlation in real time.
This would then open the door for tactical engagement to intervene in active flows that may be malicious. The threat protocol would be extensible and customizable based on policies that could be defined using a set of variables, criteria and actions. The policy variables would provide attribution about the monitored systems (pivot points), the criteria would identify a qualifying episode based on flow variables, and the actions would prescribe the workflow for automated remediation and incident response.
In addition, a set of default global policies should be available out-of- the-box to detect risky (suspect) behaviors and an easy API should allow security analysts to define local policies based on detailed information about the internal network topology and silos. Polices should also be easily transferable (with the privacy of internal network topology) for sharing the threat definitions within the security community.
And of course, we need to be able to transform and correlate multiple external threat feeds through a data transformation into a concentrated and usable set of indicators that security analysts can leverage to anticipate broader threats beyond those implied within the local IT infrastructure.
To successfully counter and defeat malware and malicious insiders, we will require better data analytics, not big data analytics. At a deeper level, we need the right analytics methods using the right detection engines and delivering evidence in real time and within context about the systems we are protecting.