An Introduction to Z-Streams (and Collective Microprediction)
Sneak preview of a new skin for Microprediction.Org. This post explains the cubes.

An Introduction to Z-Streams (and Collective Microprediction)

Microprediction.Org is an easy way to solicit high quality short term forecasts of any live quantity of your choosing.

The site is also a kind of "Tough-Mudder" for tailored and general purpose prediction algorithms. Their authors, who might be accustomed to data science contests, are challenged in new ways. They face statistical and pragmatic issues shared in real world front office quant jobs, tech jobs, manufacturing, transport, industrial control or anywhere live systems are maintained and optimized. And if you are a data scientist arriving at Microprediction.Org you may encounter something you haven't seen: z-streams. This short post provides an explanation of their role, and the more elementary prerequisite notions of streams and quarantined distributional predictions.

Outline

  • Streams
  • Distributional predictions
  • Quarantined distributional predictions
  • Community implied z-scores
  • Community implied z-curves using embeddings from [0,1]^2 -> R and [0,1]^3 -> R
  • A remark on Sklar's theorem

Streams

A stream is simply a time series of scalar (float) data created by someone who repeatedly publishes a single number. It is a public, live moving target for community contributed prediction algorithms.

Here are the names of three streams you can readily locate at Microprediction.Org (or for the AI life forms reading this, at api.microprediction.org).

three_body_x.json
three_body_y.json
three_body_z.json        

Here is a home page for the third time series showing lagged values.

No alt text provided for this image

Note the leaderboard of prediction algorithms. They are supplying distributional predictions.

Distributional predictions

Algorithms living at Microprediction.Org, or should I say interacting with it via API, don't supply single number predictions (point estimates). Here is a "proof without pictures" that point estimates are difficult to interpret.

No alt text provided for this image

In the community garden called Microprediction.Org (which you are asked to treat with loving respect) distributional forecasts comprise a vector of 225 carefully chosen floating point numbers. An algorithm submitting a forecast supplies three things:

  1. The name of a stream
  2. A delay/horizon parameter, chosen from 4 possibilities {55s, 310s, 910s, 3555s}
  3. A collection of 225 numbers.

How should the 225 numbers be interpreted? Know that the system will add a small amount of gaussian noise to each number. Know that rewards for the algorithm will be based on how close the noisy numbers are to the truth. Then do a little game theory (maybe portfolio theory) and come to your own precise interpretation. I offer the vague interpretation that your 225 points represent a kernel estimate of a distribution. In the future we may allow weighted scenarios.

Next: the interpretation of the delay parameter.

Quarantined predictions

The algorithms are firing off distributional predictions of a stream. Let's be more precise.

  • Morally a distributional prediction at Microprediction.Org comprises a vector of 225 numbers suggestive of the value that will be taken by a data point at some time in the future...say 5 minutes from now or 1 hour from now.
  • However, when making the distributional prediction, the exact time of arrival of future data points is not known by the algorithms, but must be estimated. Thus it would be more precise to say the distributional prediction applies not to a fixed time horizon but rather to the time of next arrival of a data point after some elapsed interval.

Let us pick a delay of 3555 seconds for illustration (45 seconds shy of one hour). If the data seems to be arriving once every 90 minutes, and arrived most recently at noon, it is fair to say that a set of scenarios submitted at 12:15pm can be interpreted as a collection of equally weighted scenarios for the value that will be (probably) be revealed at 1:30pm (and is thus a 75 minute ahead forecast, morally speaking).

The system doesn't care about the interpretation. When a new data point arrives at 1:34pm, it looks for all predictions that were submitted at least as far back as 12:33:15pm, a cutoff point chosen to be 3555 seconds prior. Those distributional predictions qualify to be included in a reward calculation. Each algorithm will be scored based on how many are close to the revealed truth.

Aside: bespoke real-time prediction has arrived

Pretty simple stuff. If you want forecasts of a live number, you just publish it using an API or the microprediction Python library. Then, after a few hours or days or weeks of you doing nothing, you get pretty accurate distributional forecast at various horizons. Those horizons are, as noted, 1 minute ahead, 5 minutes ahead, 15 minutes ahead and 60 minutes ahead. However if you publish once a day you will in effect receive a lot of day ahead predictions as many of the algorithms make their submissions soon after a data point is received.

No alt text provided for this image

The collective result is summarized as a collection of four community generated cumulative distribution functions (CDFs) you can view or pull data from. The bad algorithms give up or get kicked out, and better ones arrive. The CDF gets more accurate over time as algorithms (and people) find relevant exogenous data.

Community implied z-scores

The community implied CDF implies a percentile for each arriving data point. Let's suppose it has surprised the algorithms on the high side and so the percentile is 0.72 say. We call 0.72 the community implied percentile.

It will be apparent to the reader that a community implied percentile must be defined relative to some choice of quarantine period. For example the data point might be a big surprise relative to one hour ahead prediction, but less so compared to forecasts that have not been quarantined as long. The reverse can also be true. At Microprediction.Org there are two community percentiles computed: one computed using forecasts delayed more than a minute (actually 70 seconds) and one relative to those delayed more than one hour (actually 3555 seconds).

Next, we define a community z-score as the inverse normal cumulative distribution of the community implied percentile. This is a bit of a misnomer as z-scores often refer to a different rather crude standardization of data that assumes it is normally distributed. Here, in contrast, we are using the community to define a distributional transform. If the community of human and artificial life is good at making distributional predictions, the z-scores will actually be normally distributed.

Or not. There are lots of intelligent people and algorithms in this world who believe, to the contrary, that they are able to make distributional predictions about other people's distributional predictions. Some people even go so far as to suggest that they can make unconditional distributional predictions (tails that are too thin - always). Good for them. Now they now have a chance to prove this hypothesis or much more subtle ones.

That because each community z-score at Microprediction.Org is treated as a live data point in its own right - a data point that appends to its own stream which is, like any other stream, the target of quarantined distributional predictions. So, do you think you can spot deviation from the normal distribution in these community z-scores for South Australian electricity prices?

No alt text provided for this image

I would tend to agree with you ... and you may be a few lines of Python away from a great statistical triumph. Godspeed.

Community implied z-curves

Soliciting turnkey univariate predictions from a swarm of competing quasi-human life forms will soon be the norm I dare say. Let's face it, why would you do anything crazy like hire a data scientist - a proposition with vastly greater cost and far less promising asymptotic properties?

However univariate need not be the end of the story. If you care about dependencies between quantities then check out the novel approach taken at Microprediction.Org where space filling curves fold combinations of community implied percentiles back into univariate streams. For instance, look at the stream called

z2~size~usmv~3555        

and the table of past values:

No alt text provided for this image

You could use a univariate algorithm to provide 225 guesses of what the next number might be in the sequence (i.e. 225 guesses of the fifth number in the sequence 0.17791, -1.9669, 0.48892, 0.1782. ?) But if you are smart you'll take note that this sequence can be unfolded into pairs of numbers. Similarly if you were to look at:

z3~goog~jnj~nke~3555        

you will discover that it is really a sequence of three-tuples masquerading as a univariate sequence.

Here is how these z2~ and z3~ streams are created. We take community implied percentiles for two (respectively three) streams as defined above. We then do the following:

  1. Rescale the community percentiles
  2. Convert to binary representation
  3. Interleave the digits in the binary representation
  4. Convert back, and scale again
  5. Apply the inverse normal distribution function

As with the z-scores, the last step ensures that we create distributions that are sort-of normal. We term the two and three dimensional counterparts of z-scores "z-curves", to make the distinction to univariate. You can visualize the mapping from the three dimensional cube as follows:

A remark on Sklar's Theorem

We have established an algorithm smackdowns on multiple levels:

  1. At the level of the primary stream of data, at multiple horizons
  2. On implied z-scores individually
  3. On joint behavior relative to community predictions

Since algorithms can also solicit predictions, this is not the only way to stack. However, as it stands, the z-curve setup is somewhat reminiscent of Sklar's Theorem. Sklar's Theorem states loosely that the distribution of a multivariate random variable can be decomposed into:

  • Univariate margins
  • A Copula function

where for our purposes a copula is synonymous with a joint distribution on the square or the cube. Sklar's Theorem is "obvious" modulo technicalities, in the sense that any variable can be converted to uniform by applying its own (cumulative) distribution function. Thus generation of a multivariate random variable can be controlled by a throw of a continuous die taking values in a cube. Each coordinate can be transformed by application of the inverse cumulative distribution of the margin.

But what about the space filling curves? I have not been able to dig up uses of space filling curves as a means of describing Copula functions, and at some level this seems imperfect. But by folding bivariate and trivariate community prediction back into univariate there are some compensating pragmatic gains to be had.

Whether you view the packing into one dimension as a lossy technology convenience or something more is up to you. There are also some interesting and, I think, understudied aspects to this. The reader might wish to contemplate the approximate analytical relationship between two correlated random variables (however that is parametrized) and the variance or volatility of their z-curve. For instance bivariate normal with correlation 30% yields 15% excess standard deviation over standard normal. Its roughly a rule of two.

What remains a matter of experiment is whether arbitraging algorithms can bring Sklar's Theorem to life in an effective and visceral manner, and whether the separation of concerns suggested by Sklar's Theorem is useful, or not, when it comes to determining accurate higher dimensional probabilistic short term forecasts.

This question is particularly pertinent for quantities such as stocks where some moments (the stock margins) are traded explicitly but many are not (most volatility is not directly traded even). The intraday dependence structure between style investing factors (like size, value, momentum and so forth) is a subtle but very important thing in fund management - so involving a diversity of algorithms and perspectives seems prudent, as does not expecting any one algorithm or person to solve the puzzle in its entirety.

You may not care about stocks and that's fine. There isn't a lot to prevent one algorithm finding its way from stocks to train delays to weather in Seattle. You can derive from MicroCrawler class to advance a new kind of algorithm reuse and cross-subsidy.

Nuts and bolts...

The best specification of the precise conventions for z-curves (and also naming conventions to help you navigate the hundreds of streams at Microprediction.Org) is the microconventions package on GitHub or PyPI. This is used by the microprediction package which you can use to submit predictions or solicit them. There is a tiny example package called echochamber on PyPI which demonstrates use of the latter.

Oh and I didn't want to mention this to anyone who is not absolutely fascinated by multivariate distributional prediction, but since you read this far then let me add that there are some cash incentives (totally $4,000 in July) for closed and open source algorithms, and other kinds of contribution as well. Here's another subtle reminder (aside: name the mathematician/philosopher patron saint of the epidemic stream to win exactly nothing).

No alt text provided for this image

See you at Microprediction.Org



About Me

Hi I'm the author of?Microprediction: Building an Open AI Network ?published by MIT Press. I create open-source Python packages such as?timemachines ,?precise ?and?humpday ?for benchmarking, and I maintain a live prediction exchange at?www.microprediction.org ?which you can participate in (see?docs ). I also develop portfolio techniques for?Intech Investments ?unifying hierarchical and optimization perspectives.

Gareth Moody

Sales @ Empirasign - Fixed Income Market Data, Parser on Demand, IG / HY, Loans, Structured

4 年

Nobody could have predicted this.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了