登录查看更多内容

An Introduction to Z-Streams (and Collective Microprediction)

Peter Cotton

Helping to build a decentralized prediction network

发布日期: 2020年7月2日

Microprediction.Org is an easy way to solicit high quality short term forecasts of any live quantity of your choosing.

The site is also a kind of "Tough-Mudder" for tailored and general purpose prediction algorithms. Their authors, who might be accustomed to data science contests, are challenged in new ways. They face statistical and pragmatic issues shared in real world front office quant jobs, tech jobs, manufacturing, transport, industrial control or anywhere live systems are maintained and optimized. And if you are a data scientist arriving at Microprediction.Org you may encounter something you haven't seen: z-streams. This short post provides an explanation of their role, and the more elementary prerequisite notions of streams and quarantined distributional predictions.

Outline

Streams
Distributional predictions
Quarantined distributional predictions
Community implied z-scores
Community implied z-curves using embeddings from [0,1]^2 -> R and [0,1]^3 -> R
A remark on Sklar's theorem

Streams

A stream is simply a time series of scalar (float) data created by someone who repeatedly publishes a single number. It is a public, live moving target for community contributed prediction algorithms.

Here are the names of three streams you can readily locate at Microprediction.Org (or for the AI life forms reading this, at api.microprediction.org).

three_body_x.json
three_body_y.json
three_body_z.json

Here is a home page for the third time series showing lagged values.

Note the leaderboard of prediction algorithms. They are supplying distributional predictions.

Distributional predictions

Algorithms living at Microprediction.Org, or should I say interacting with it via API, don't supply single number predictions (point estimates). Here is a "proof without pictures" that point estimates are difficult to interpret.

In the community garden called Microprediction.Org (which you are asked to treat with loving respect) distributional forecasts comprise a vector of 225 carefully chosen floating point numbers. An algorithm submitting a forecast supplies three things:

The name of a stream
A delay/horizon parameter, chosen from 4 possibilities {55s, 310s, 910s, 3555s}
A collection of 225 numbers.

How should the 225 numbers be interpreted? Know that the system will add a small amount of gaussian noise to each number. Know that rewards for the algorithm will be based on how close the noisy numbers are to the truth. Then do a little game theory (maybe portfolio theory) and come to your own precise interpretation. I offer the vague interpretation that your 225 points represent a kernel estimate of a distribution. In the future we may allow weighted scenarios.

Next: the interpretation of the delay parameter.

Quarantined predictions

The algorithms are firing off distributional predictions of a stream. Let's be more precise.

Morally a distributional prediction at Microprediction.Org comprises a vector of 225 numbers suggestive of the value that will be taken by a data point at some time in the future...say 5 minutes from now or 1 hour from now.
However, when making the distributional prediction, the exact time of arrival of future data points is not known by the algorithms, but must be estimated. Thus it would be more precise to say the distributional prediction applies not to a fixed time horizon but rather to the time of next arrival of a data point after some elapsed interval.

Let us pick a delay of 3555 seconds for illustration (45 seconds shy of one hour). If the data seems to be arriving once every 90 minutes, and arrived most recently at noon, it is fair to say that a set of scenarios submitted at 12:15pm can be interpreted as a collection of equally weighted scenarios for the value that will be (probably) be revealed at 1:30pm (and is thus a 75 minute ahead forecast, morally speaking).

The system doesn't care about the interpretation. When a new data point arrives at 1:34pm, it looks for all predictions that were submitted at least as far back as 12:33:15pm, a cutoff point chosen to be 3555 seconds prior. Those distributional predictions qualify to be included in a reward calculation. Each algorithm will be scored based on how many are close to the revealed truth.

Aside: bespoke real-time prediction has arrived

Pretty simple stuff. If you want forecasts of a live number, you just publish it using an API or the microprediction Python library. Then, after a few hours or days or weeks of you doing nothing, you get pretty accurate distributional forecast at various horizons. Those horizons are, as noted, 1 minute ahead, 5 minutes ahead, 15 minutes ahead and 60 minutes ahead. However if you publish once a day you will in effect receive a lot of day ahead predictions as many of the algorithms make their submissions soon after a data point is received.

The collective result is summarized as a collection of four community generated cumulative distribution functions (CDFs) you can view or pull data from. The bad algorithms give up or get kicked out, and better ones arrive. The CDF gets more accurate over time as algorithms (and people) find relevant exogenous data.

Community implied z-scores

The community implied CDF implies a percentile for each arriving data point. Let's suppose it has surprised the algorithms on the high side and so the percentile is 0.72 say. We call 0.72 the community implied percentile.

It will be apparent to the reader that a community implied percentile must be defined relative to some choice of quarantine period. For example the data point might be a big surprise relative to one hour ahead prediction, but less so compared to forecasts that have not been quarantined as long. The reverse can also be true. At Microprediction.Org there are two community percentiles computed: one computed using forecasts delayed more than a minute (actually 70 seconds) and one relative to those delayed more than one hour (actually 3555 seconds).

Next, we define a community z-score as the inverse normal cumulative distribution of the community implied percentile. This is a bit of a misnomer as z-scores often refer to a different rather crude standardization of data that assumes it is normally distributed. Here, in contrast, we are using the community to define a distributional transform. If the community of human and artificial life is good at making distributional predictions, the z-scores will actually be normally distributed.

Or not. There are lots of intelligent people and algorithms in this world who believe, to the contrary, that they are able to make distributional predictions about other people's distributional predictions. Some people even go so far as to suggest that they can make unconditional distributional predictions (tails that are too thin - always). Good for them. Now they now have a chance to prove this hypothesis or much more subtle ones.

That because each community z-score at Microprediction.Org is treated as a live data point in its own right - a data point that appends to its own stream which is, like any other stream, the target of quarantined distributional predictions. So, do you think you can spot deviation from the normal distribution in these community z-scores for South Australian electricity prices?

Naveen Joshi 3 年前

Data Phoenix Digest - ISSUE 4.2024

Dmytro Spodarets 9 个月前

#Artificial Intelligence #25 - My challenges with the…

Ajit Jaokar 3 年前

I would tend to agree with you ... and you may be a few lines of Python away from a great statistical triumph. Godspeed.

Community implied z-curves

Soliciting turnkey univariate predictions from a swarm of competing quasi-human life forms will soon be the norm I dare say. Let's face it, why would you do anything crazy like hire a data scientist - a proposition with vastly greater cost and far less promising asymptotic properties?

However univariate need not be the end of the story. If you care about dependencies between quantities then check out the novel approach taken at Microprediction.Org where space filling curves fold combinations of community implied percentiles back into univariate streams. For instance, look at the stream called

z2~size~usmv~3555

and the table of past values:

You could use a univariate algorithm to provide 225 guesses of what the next number might be in the sequence (i.e. 225 guesses of the fifth number in the sequence 0.17791, -1.9669, 0.48892, 0.1782. ?) But if you are smart you'll take note that this sequence can be unfolded into pairs of numbers. Similarly if you were to look at:

z3~goog~jnj~nke~3555

you will discover that it is really a sequence of three-tuples masquerading as a univariate sequence.

Here is how these z2~ and z3~ streams are created. We take community implied percentiles for two (respectively three) streams as defined above. We then do the following:

Rescale the community percentiles
Convert to binary representation
Interleave the digits in the binary representation
Convert back, and scale again
Apply the inverse normal distribution function

As with the z-scores, the last step ensures that we create distributions that are sort-of normal. We term the two and three dimensional counterparts of z-scores "z-curves", to make the distinction to univariate. You can visualize the mapping from the three dimensional cube as follows:

A remark on Sklar's Theorem

We have established an algorithm smackdowns on multiple levels:

At the level of the primary stream of data, at multiple horizons
On implied z-scores individually
On joint behavior relative to community predictions

Since algorithms can also solicit predictions, this is not the only way to stack. However, as it stands, the z-curve setup is somewhat reminiscent of Sklar's Theorem. Sklar's Theorem states loosely that the distribution of a multivariate random variable can be decomposed into:

Univariate margins
A Copula function

where for our purposes a copula is synonymous with a joint distribution on the square or the cube. Sklar's Theorem is "obvious" modulo technicalities, in the sense that any variable can be converted to uniform by applying its own (cumulative) distribution function. Thus generation of a multivariate random variable can be controlled by a throw of a continuous die taking values in a cube. Each coordinate can be transformed by application of the inverse cumulative distribution of the margin.

But what about the space filling curves? I have not been able to dig up uses of space filling curves as a means of describing Copula functions, and at some level this seems imperfect. But by folding bivariate and trivariate community prediction back into univariate there are some compensating pragmatic gains to be had.

Whether you view the packing into one dimension as a lossy technology convenience or something more is up to you. There are also some interesting and, I think, understudied aspects to this. The reader might wish to contemplate the approximate analytical relationship between two correlated random variables (however that is parametrized) and the variance or volatility of their z-curve. For instance bivariate normal with correlation 30% yields 15% excess standard deviation over standard normal. Its roughly a rule of two.

What remains a matter of experiment is whether arbitraging algorithms can bring Sklar's Theorem to life in an effective and visceral manner, and whether the separation of concerns suggested by Sklar's Theorem is useful, or not, when it comes to determining accurate higher dimensional probabilistic short term forecasts.

This question is particularly pertinent for quantities such as stocks where some moments (the stock margins) are traded explicitly but many are not (most volatility is not directly traded even). The intraday dependence structure between style investing factors (like size, value, momentum and so forth) is a subtle but very important thing in fund management - so involving a diversity of algorithms and perspectives seems prudent, as does not expecting any one algorithm or person to solve the puzzle in its entirety.

You may not care about stocks and that's fine. There isn't a lot to prevent one algorithm finding its way from stocks to train delays to weather in Seattle. You can derive from MicroCrawler class to advance a new kind of algorithm reuse and cross-subsidy.

Nuts and bolts...

The best specification of the precise conventions for z-curves (and also naming conventions to help you navigate the hundreds of streams at Microprediction.Org) is the microconventions package on GitHub or PyPI. This is used by the microprediction package which you can use to submit predictions or solicit them. There is a tiny example package called echochamber on PyPI which demonstrates use of the latter.

Oh and I didn't want to mention this to anyone who is not absolutely fascinated by multivariate distributional prediction, but since you read this far then let me add that there are some cash incentives (totally $4,000 in July) for closed and open source algorithms, and other kinds of contribution as well. Here's another subtle reminder (aside: name the mathematician/philosopher patron saint of the epidemic stream to win exactly nothing).

See you at Microprediction.Org

About Me

Hi I'm the author of?Microprediction: Building an Open AI Network ?published by MIT Press. I create open-source Python packages such as?timemachines ,?precise ?and?humpday ?for benchmarking, and I maintain a live prediction exchange at?www.microprediction.org ?which you can participate in (see?docs ). I also develop portfolio techniques for?Intech Investments ?unifying hierarchical and optimization perspectives.

Gareth Moody

Sales @ Empirasign - Fixed Income Market Data, Parser on Demand, IG / HY, Loans, Structured

4 年

Nobody could have predicted this.

Peter Cotton

Helping to build a decentralized prediction network

4 年

July prizes totaling $4,000 https://dev.microprediction.org/july.html Crawling quickstart: https://dev.microprediction.org/crawling.html Walkthrough for creating a packaged crawler on PyPI https://www.dhirubhai.net/pulse/you-love-your-algorithm-set-free-peter-cotton-phd/ MicroCrawler code: https://github.com/microprediction/microprediction/blob/master/microprediction/crawler.py Echochamber example packaged crawler: https://github.com/microprediction/echochamber Article with more details on navigation: https://www.dhirubhai.net/pulse/economical-statistics-how-modify-prediction-crawlers-cotton-phd/

查看更多评论

要查看或添加评论，请登录

查看全部

An Introduction to Z-Streams (and Collective Microprediction)

Peter Cotton

Helping to build a decentralized prediction network

Outline

Streams

Distributional predictions

Quarantined predictions

Aside: bespoke real-time prediction has arrived

Community implied z-scores

领英推荐

Community implied z-curves

A remark on Sklar's Theorem

Nuts and bolts...

About Me

更多精彩文章

社区洞察

其他会员也浏览了

#Artificial Intelligence #25 - My challenges with the definition of data centric vs model centric

Responsible Business Intelligence - The Power of a Story

AI in 2024 - some predictions

Harnessing the Wide-Angle Insights of Knowledge Graphs

Some Fundamentals – Process, Data and Models

The Language of Statistics

Edition #67 - Analytics Bites - Experts Warn of Extinction Risk Posed by AI

Navigating the Enigmatic World of Machine Learning with Regularization

Day-Trading with AI: When to Hold, When to Fold, and When to Not Play!

Executive Perspectives: Machine Learning/AI

Outline

Streams

Distributional predictions

Quarantined predictions

Aside: bespoke real-time prediction has arrived

Community implied z-scores

领英推荐

Community implied z-curves

A remark on Sklar's Theorem

Nuts and bolts...

About Me

Shutting Down California — The Billion Dollar Prediction Problem

2023年5月1日

Nine Yards is Enough - Why NFL Receivers and Running Backs Should Stop Shy of the First Down

2020年10月26日

Comparing Python Global Optimization Packages

2020年10月19日

The Instant, Morbid Reaction to the "Worst Debate in History"

2020年10月1日

Be the World's Most Asymptotically Productive Data Scientist (Deploying Models Edition)

2020年8月10日

Live, Online Distribution Estimation Using t-Digests

2020年7月29日

Benchmarking AutoML Vendors and Open Source Time Series Packages

2020年7月22日

On Masks and Seat Belts. COVID Cases Mount Higher where Ralph Nader was Unpopular, and Conversely.

2020年7月16日

Ever Lost an Algorithm? A Suggestion for Addressing the Reusability Crisis

2020年7月14日

Where will a Badminton Player Move to Next, and How Should we Adjudicate Predictions of the Same?

2020年7月13日

社区洞察

其他会员也浏览了

#Artificial Intelligence #25 - My challenges with the definition of data centric vs model centric

Responsible Business Intelligence - The Power of a Story

AI in 2024 - some predictions

Harnessing the Wide-Angle Insights of Knowledge Graphs

Some Fundamentals – Process, Data and Models

The Language of Statistics

Edition #67 - Analytics Bites - Experts Warn of Extinction Risk Posed by AI

Navigating the Enigmatic World of Machine Learning with Regularization

Day-Trading with AI: When to Hold, When to Fold, and When to Not Play!

Executive Perspectives: Machine Learning/AI