A fresh take on time series forecasting
DALL-E 3

A fresh take on time series forecasting

We introduce a new machine learning technique that outperforms XG Boost for anticipating some critical EPSS (Exploit Prediction Scoring System) scores.

Although forecasting EPSS is specific, there are likely opportunities to adapt it to other engineering fields.


To get started, let's make it clear what we mean by "forecasting EPSS", and why.

The "why"

EPSS is already publishing a daily probability that a CVE is going to be exploited within the next month, so what's the point of predicting a prediction?

To gain lead time.

If we can predict that the EPSS score of a CVE will rise past a critical threshold within the next weeks or months, and if this vulnerability affects a widespread software component within our information system, we can anticipate the need to fix it. The ability to anticipate is of critical importance for all large corporations running tens of thousands of nested software components in various versions.

The "what"

The vast majority of vulnerabilities get a daily EPSS score which almost never changes. This is what I call the EPSS graveyard. The older the vulnerability, the more likely it is to remain in this steady state for a very long time, if not forever.

What's more, when a score does change, it often does so abruptly. Typically, you will observe the score rising from the lowest percentile to some of the highest. Such abrupt changes are what I call EPSS false negatives: failed attempts from EPSS at predicting an exploit.

Leaving aside graveyard and false negatives, we are left with a bunch of young vulnerabilities enjoying relatively smooth, if not frequent, changes. This is the EPSS livestock. The best way to understand it is as a bunch of time series.


Understanding EPSS livestock dynamics

A key finding is that we observe different regimes based on livestock maturity:

  • very young vulnerabilities (a week old, or less) tend to move rather "frantically"
  • mature vulnerabilities (a few months or more) tend to have a much more stable regime
  • between both extremes, vulnerabilities move to the extremes of the probability spectrum: many of them converge towards EPSS 0.01, only a few of them converge towards EPSS 0.95+.

To clarify the dynamics of this livestock, we can sketch its underlying probability distribution as a Bernoulli process with three states (a trinomial distribution): "Lo", "Steady" and "Hi". The "Lo" state probability pulls CVEs towards an extremely low EPSS score, and the "Hi" state probability pull them towards an extremely high one. The "Steady" state represents the most likely state. It is equal to 1.0 - "Lo" - "Hi".

Over time, the livestock becomes exposed to wear and attrition: "Lo" and "Hi" decrease. Mechanically, this increases "Steady". After a few months, only the steady state remains. This milestone marks the transition of the CVE from the EPSS livestock to the graveyard.

This high level description will be demonstrated later in this article, namely in the sampler section.

To summarize,

we want to gain an edge on potential EPSS livestock exploit outbreaks by forecasting their time series for different maturity horizons.


Time series prediction

Metamodeling

Feeding past EPSS scores into a prediction system involves chaining, or composing, two machine learning regressors:

  1. the original EPSS prediction system, the input regressor, yielding daily exploit probabilities filtered as EPSS livestock
  2. a new prediction system, the output regressor, feeding from this livestock to yield different horizon forecasts.

In machine learning, chaining several models is called metamodeling.

Input predictor

In the case of EPSS, the predictor is XG Boost, a popular machine learning algorithm used in a wide range of classification and regression tasks. XG Boost relies on many hundreds of carefully curated features attached to every single CVE.

The timeframe of scores produced by this input regressor is narrowly constrained: it must fit within two subsequent versions of EPSS. Version 3, the latest, started in March 2023. The next one, which might be 3.1, is expected to be coming soon. As of today, this leaves us with about 15 months of data.

Output predictor

The choice of the output model is open. Given the scantiness of data points in each series, traditional time series prediction techniques are prone to fail. A more interesting approach is to use, once again, the extremely powerful XG Boost.

Here, we need much less features than the input model: we are mostly interested in the past data points where the EPSS score changed, and of course we need the "age" of the CVE to determine the maturity horizon we are talking about.

Our plan is simple: the more data points, the more features, the better the prediction.

Except that... it doesn't work well. Why?

The many reasons why traditional prediction falls short

The EPSS livestock is incredibly useful, but also incredibly scant. Over the course of 15 months (the timeframe we explained above), we identify only about 100 CVEs out of more than 50000, roughly half of which are EPSS false negatives. What's more, we must split the dataset into a training set and a prediction set.

It is customary to use a 80/20 ratio for prediction versus training, but we can't do that here: because we are so short of true positives, we need more samples in the training set.

The ratio we use is about 66/33.

Historical depth

Despite being smaller than usual, this ratio still doesn't solve an important issue for time series forecasting: the need of significant historical depth in the training set.

33% of samples in the training set means 5 months of historical depth out of 15 months: that's quite shallow.

Training XG Boost with this set, we can't make reliable predictions for vulnerabilities which are older than 5 months.

Features imbalance

The issue is further complicated by "features imbalance": suppose 150 new vulnerabilities are published each day by the input regressor. After 5 months, the training set will contain 22500 distinct vulnerabilities ventilated as such:

  1. 150 vulnerabilities will have a defined EPSS score at t0+150 (5 months after t0)
  2. 4500 vulnerabilities will have a defined EPSS score at t0+120
  3. 9000 vulnerabilities will have a defined EPSS score at t0+90
  4. 13500 vulnerabilities will have a defined EPSS score at t0+60
  5. 18000 vulnerabilities will have a defined EPSS score at t0+30

Data points at times t0+30, t0+60, t0+90, t0+120 and t0+150 are features that we feed into XG Boost to make predictions. Clearly, the number of samples for each of these features is heavily imbalanced.

Decay

Last but not least, the input model is facing constant performance decay: past a certain point, it must be retrained, its version must change, and the historical record becomes stale and spurious or biased in the new version.

When EPSS switches to version 3.1, the output model will also have to be retrained, and we will have to wait for... 3 to 5 months at least to get a new batch of decent samples of young vulnerabilities.


Markov to the rescue!

To work around these many problems, we switched to another model. The approach we followed was to build a Hidden Markov Model (HMM) that fires daily transitions between EPSS "states".

Despite its barbarian name, HMM is quite easy to understand: we start from an EPSS score at day 0, and we use a matrix (a probability of transitions), to calculate a likely score the next day (day 1).

Rinse and repeat to generate a time series until we reach a score at some target date (say, at day 30).

Using HMM has its pros and cons, however: the main pro is that it is highly susceptible to extremely rare outliers, which EPSS livestock events are, so that's a very good thing.

Another interesting pro, that we don't use but may in future works, is that unlike usual machine learning models, the weights can be adjusted without retraining the whole model. For example, if we want to account for a new pattern "in emergency".

The main cons is that HMM, in its basic version, depends on only one feature (a single past data point), and a single matrix. It's usually good for zero-shot predictions, but it is usually not so good at generating whole time series. Inevitably, inaccuracy kicks in as we attempt to re-use the same matrix over and over again to make long predictions.

To leverage HMM, we must improve matrix re-usability.

Boosting Markov time series

Rather than using static weights in the matrix, a common HMM practice is to use different weights every day.

Updating matrix weights every X days to better stick to EPSS dynamics.


The calculation of weights is CPU intensive: for a whole 15 months, it takes about 4 days to a week on a standard, entry level VM. That being said, the weights don't change that much between two days, so we just need a sample of days (For example, one sample in 30 days) to generate a few matrices (15 matrices in our example) and interpolate between the matrices to cover the whole period.

Interpolation is nice, but there is a better way, inspired from quantum computing.

Two years ago, I published a paper in Quantum Grad called "Implementation of a Quantum control with qDRIFT ", where I discussed the qDRIFT quantum algorithm.

I won't dive into the details of quantum physics today, it's not the purpose of this article.

Let's just say that qDRIFT aims at solving an important puzzle: evolving the state of a (quantum) system by breaking it down into simpler components which are easy to implement on a quantum circuit.

In quantum computing, the state of a system is captured by a matrix, called a Hamiltonian. The matrix changes over time, exactly like our Markov matrix.

For every time step, the new Hamiltonian is calculated stochastically by picking one of the simple components at random, according to a distribution which depends on time.

We can do exactly the same with our HMM: every day, we choose one matrix at random between two pre-calculated matrices which stand 30 days apart. The likelihood depends on how far the current day is from these two matrices.

Compared with a classical interpolation, qDRIFT requires slightly less calculations. But the main benefit is that we can also adjust the size of the matrices dynamically, depending on data scarcity.

Why would we do that? As we saw earlier, the number of available data points at times t0+30, etc is very imbalanced. For each monthly matrix that we use, the number of states should be driven by a trade-off between the number of available samples and the accuracy of the transitions:

  • too many states can be detrimental to accuracy if samples are too scant
  • too few states can be detrimental to accuracy if we have more samples

At t0+30, we have 18000 samples so we should favor more states, hence a high-order matrix.

At T0+150, we have only 150 samples so we should favor less states, hence a low-order matrix.

The thing is, we can't "morph" a higher-order matrix into a lower-order matrix using classical interpolation, because interpolation is a continuous transformation (from a topological standpoint).

But we can if we reason stochastically, like qDRIFT does! Between two days, we pick either the previous or the next matrix, regardless of its order. Simple as that.

Moving from 3 states (3x3 matrix) to 2 states (2x2 matrix) progressively.

To further optimize the transition from a high order matrix to a low one, we may seize the opportunity not to only move from one state to another one, but to several ones. For that, we introduce noise during the shapeshifting process. (Observe that noise is not necessary if we change only weights and not shapes).

This trick only works if the number of states in the higher order matrix is not a multiple of the number of states of the lower order matrix.

Opportunity of using noise when shape changing (3 is not a multiple of 2).


Monte Carlo runs

Creating a convincing time series is one skill, but creating a useful one is another entirely: we need to generate many different time series of the same CVE to gain reliable insights about its possible future time course. This is called performing a Monte Carlo simulation.

Typically, we need 1000 to 10000 Monte Carlo runs to make sense of what's going on. Here's a ral-life example with CVE-2023-4966.


Monte Carlo simulation of EPSS evolution of a CVE using Markov + qDRIFT, 3 weeks forecast.

One can see that the distribution of potential future scores at t0+22 days is "almost" uniform, and, as we will shortly see, the real insights are hidden behind this word, "almost".

Most importantly, what we've just done for this CVE can indeed be generalized to all CVEs within the same Markovian state AND enjoying the same maturity, because they will all be predicted the same.

How can we do that? Quite easily. By building a sampler.

The EPSS sampler

As one can see from the previous picture, today's EPSS score for CVE-2023-4966 is about 0.18.

Suppose we have divided the EPSS probability range into 10 states. It means that 0.18, the score of CVE-2023-4966, belongs to the second state ranging from 0.1 to 0.2

We are thus able to sample all possible CVEs with initial score between 0.1 and 0.2 and calculate a (discrete) Probability Density Function (PDF). We can do that for all 10 intervals, spanning the whole probability scope (0.0 to 1.0)

In each interval, we will get a PDF "centered" around the mean of the interval. Here is a an example for current EPSS scores in interval 0.2 to 0.3 (centered around 0.25)

PDF of 3 weeks EPSS forecasts for low maturity vulnerabilities with current EPSS score 0.25

The PDF will look quite different for initial scores in interval 0.8 to 0.9

PDF of 3 weeks EPSS forecasts for low maturity vulnerabilities with current EPSS score 0.85

If we carefully analyze all 10 intervals, we can give ground to our initial assumption that EPSS livestock behaves roughly like a Bernoulli process:

  • In the first PDF ranging from 0.2 to 0.3, the "Lo" and "Hi" parameters pulling a livestock CVE towards its ultimate doom in the EPSS graveyard are clearly visible around 0.0 and 0.9 respectively. Overwhelmingly, "Lo" will steer the CVE towards a very low graveyard score.
  • In the second PDF, the "Lo" parameter is almost non-existent. Overwhelmingly, "Hi" will steer the CVE towards a very high graveyard score.

We are now all set for the final step: building the output predictor for our meta model!

Meet EPSSilon qDRIFT, a new model

In reality, the predictor is only a stone's throw away from our sampler. That's because we solved almost all the difficult challenges when we made the sampler.

The last concept we need to introduce is percentile-based confidence intervals (PCI).

Let's take a look at all 10 intervals, from 0.0 to 1.0:

Sampling the whole EPSS livestock scope (for young maturity CVEs)

Observe how the "Hi" Bernoulli peaks behave quite differently from one interval to the next. In the first and fourth intervals, for instance, the "Hi" peaks are quite weak. The location of these weak peaks depends on the maturity of vulnerabilities (the graph displayed above focuses on young vulnerabilities only), but the general pattern which works in all cases is this one: we want to exclude intervals with weak ??Hi?? peaks to remove false positives.

What filter could we use to exclude intervals? We need to strike a balance between excluding all 10 intervals and excluding none...

To answer this question accurately, let's harken back to our original point regarding the critical threshold an EPSS livestock score shall pass to be considered "risky". We want to tune our filter to keep only intervals where the threshold is passed with a certain level of confidence.

This is where PCI is useful: we experiment with different percentiles and see which ones lead to a clean delineation between weak ??Hi?? peaks (leading to low score CVEs that we want to discard), and strong ??Hi?? peaks that we want to keep.

At this stage, it's crucially important to understand that the critical threshold is not a hyperparameter of our model: we need this threshold no matter the regressor, be it XG Boost or Markov.

The percentile, on the other hand, is a hyperparameter of the model.

What I've found working pretty well is a critical threshold of 0.66 and a PCI of 0.90:

  • 0.66 means that, to qualify as EPSS livestock, a young vulnerability must enjoy an initial EPSS score lower than 0.66
  • 0.90 means that when we perform Monte Carlo runs, we remove 10% of the most extreme results (this will remove the peaks when they are too weak) and keep the 90% most "regular" results.

And,

there we have it! EPSSilon qDRIFT, a regressor based on HMM, qDRIFT, Monte Carlo sampling and PCI.

Performance results

To illustrate the relevance of our approach, we share some results of a retrocasting we did from 2023-03-12 to 2024-06-14, where the training set ranges from 2023-03-12 to 2023-08-30 and the prediction set ranges from 2023_08-31 to 2024-06-14.

We defined two kinds of predictions which are likely to make sense for secOps, they are based on the maturity of vulnerabilities:

  • young vulnerabilities, which started to be tracked by EPSS 8 days ago with a current score lower than the critical threshold (0.66)
  • mature vulnerabilities, which started to be tracked by EPSS 4 months ago (The maximum affordable from our training set, as already explained) with a current score lower than the critical threhold (0.66)

We exclude EPSS false positives.

A prediction is a true positive is the outer regressor correctly predicts that the vulnerability will pass the critical threshold of 0.66 with X days.

X is 3 weeks for young vulnerabilities, and a month for mature ones.

3 weeks prediction for young vulnerabilities

  • The training data used by XG Boost are the initial EPSS score (at t0) and the current score (at t0 + 8 days).
  • EPSSilon qDRIFT uses the score at t0+8 days, and two 30x30 matrices, called C and D, which were trained using only data from the training set.

We expect 3 true positives during that period.

XG Boost doesn't report any true positive, EPSSilon reports the 3 of them.

1 month predictions for mature vulnerabilities

  • The training data used by XG Boost are the scores at t0, t0+8, t0+30, t0+60, t0+90 and t0+120 (the current score).
  • EPSSilon uses the score at t0+120 days, and two 30x30 matrices, called F and G, which were trained using only data from the training set.

The aim for XG Boost and EPSSilon is to predict the score at t0+150.

We also expect 3 true positives during that period.

The result is very similar to the previous test: XG Boost doesn't report any true positive. EPSSilon reports the 3 of them.


Performance summary

t0+30 days: EPSSilon 100% true positives, XG Boost 0%

t0+60 days: XG Boost 100% true positives, EPSSilon 50%

t0 +90 days: EPSSilon 100% True positives, XG Boost 0%

t0 + 120 days: EPSSilon 100% True positives, XG Boost 100%, with EPSSilon True positives much better ranked.

t0 + 150 days: EPSSilon 100% True positives, XG Boost 0%

Takeaway

Using metamodeling, we've shown that, in the current version of EPSS, based on 15 months observations, the short-term and medium-term evolution of EPSS livestock seems to obey a predictable pattern (if we exclude EPSS false negatives).

It is therefore possible for secOps teams to gain lead time and anticipate the fixing of livestock vulnerabilities relevant to their information system.

The predictable pattern depends on two criteria:

  1. the maturity of the vulnerability
  2. the interval into which its current EPSS score fall.

The pattern doesn't seem to be captured by XG Boost correctly, due to the specific challenges of EPSS livestock. It may, however, be captured by a new regressor called EPSSilon qDRIFT, specifically crafted for predicting rare outliers.

Clint Gibler

Sharing the latest cybersecurity research at tldrsec.com | Head of Security Research at Semgrep

4 个月

Very cool, thanks for sharing Christophe!

Sylvain Cortes

VP of Strategy @ Hackuity ?? Speaker ?? Follow me on Linkedin to be updated on ?????????????????????????? and ?????? news ??

4 个月

what a great content!

Joost Grunwald

Ik help met cybersecurity en het IBP-normenkader??

4 个月

Very nice, will you make available code/models?

Abderrahmane Smimite

Ph.D, CISSP, SPC | Cloud, Data/AI and Cyber Security | Open Source Advocate ??????

4 个月

Great paper Christophe Parisel well done! Long way since Prophet ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了