登录查看更多内容

Ground Truth Aggregation

Ian White

3x Founder | Building the Future of Fintech, Mobility & Logistics | Data-driven Solutions

发布日期: 2017年10月16日

[Note: this post is latest in my series of “startups that weren’t.” You can read more about other deceased ideas in the idea graveyard and read this post natively on my blog.]

Not too long ago quant research strategies employed by hedge funds were one of the few places time series data was analyzed at scale. But with the flood of telemetry from probes, remote sensing and transactional/log data of all stripes, the growth of novel analytic/big data/machine learning techniques for predictive purposes has exploded. To explain this opportunity I am using movement/location data, but other flavors of transactional data also apply. It’s an easier one for me to explain with 10+ years experience in geo.

At the risk of being overly pedantic, models serve to approximate reality. Data can help better inform reality once a model is developed, but without additional input, the model isn’t likely to magically improve. In the context of spatially-referenced things (I dislike the term GIS as it carries a lot of baggage), what one observes may not be the totally of what exists. When applying geotech to a specific application, vertical, process, etc… this question may be of paramount importance, or not.

To understand any market with precision, ground truth must be employed to calibrate models. The reason for this is straight-forward: unless your panel (in this case likely location events from an SDK) sees the complete universe of offline activity, bias is present. This bias can be represented as demographic, technical and behavioral, and it can range from a non-issue to a product-killer.

Why does one care about bias? If you care to make claims about a population, you’ll be sampling. How does your panel stack up to the overall US population in terms of demographics, mobility patterns and technology? If the panel is skewed with 90% iPhone users, be aware. If the panel is predominantly suburban, be aware. If location events are logged arbitrarily between 7-12 hours, be aware. If the panel is 500,000 monthly active users, be aware.

The good news is these details matter far less outside of alpha-exploiting quant funds. In retail and CPG, indices, period-on-period change, share of wallet and market share are about as sophisticated as one needs. Easy to understand examples of this are reports from xAd, PlaceIQ, Foursquare and others. But when absolute precision is called for, it isn’t possible to hide under binned values. Enter the world of ground truth.

What is Ground Truth, and Why Do I Care?

In essence, ground truth is an indisputable record of fact– the number of airplanes that take off from SFO in a given day, the soybean yield in Ellis County, Kansas for a given harvest, offramp traffic on Exit 261A in Florida over a holiday weekend, visitors to a Lidl supermarket the Saturday prior to Labor Day, 500mb pressure maps for next week, the US Bureau of Labor Statistics Non-farm Payroll report for August or any other number of discrete measurements. Because models can only as good as the data that goes into them, historical data is invaluable, and more is better. It is also harder to come by than you may think. As an aside, don’t forget the difference between forecasting/predicting and measuring/reporting.

Historically, ground truth has come from independent sources like government and NGOs. Increasingly it can come from private actors, but there are relatively few places where this data can be captured. Sporting venues, public parks and travel nodes are obvious candidates. However, because of critical reliance on historical data for back-testing and model training in machine learning), a newly-identified ground truth source today may mean it can’t be relied on for 18 months or more (this depends on the number of observations, industry focus, required precision and other factors). So the sooner one can aggregate sources of ground truth (and surreptitiously convince others to log new ones), the sooner this data will be of use in model development. With fewer words, it looks like this:

In short, the business idea was to aggregate and track granular sources of ground truth to allow development of more robust/accurate models that can be used for reporting and forecasting. I looked at the opportunity from a few perspectives and they are outlined below. Later in this post I evaluate the good, bad and ugly, and my conclusions are (naturally) at the end.

[Ed: When xAd rebranded as GroundTruth I couldn’t help wince as they are most definitely not offering services that provide ground truth, per the above.]

Product

Something about this opportunity spoke to me, likely a result of a decade in geo where without data, you are nowhere. At Urban Mapping we’d often create our own data products by stringing together public record requests, perform lots of normalization/ETL to get data to ‘behave’ and create novel data products. I enjoy going into a wormhole to find and create new ones (like a database of all freestanding USPS blue post boxes, complete with attributes!).

This product would entail identifying many ground truth sources, constantly filing/appealing state and federal public record requests, ingesting historical data and capturing updates. There is also an emerging category of private actors who traffic in what I refer to as ‘synthetic ground truth’ and are a potential gold mine of historical data. I had a reasonably good handle on what data would be valuable for several markets, so sources dealing with location was a good place to start.

Knowing that enterprise data consumers are far from uniform–some insist on raw transaction logs, others want something more refined/normalized and some want polished reports/summaries viewed in Tableau or something similar, it would be critical to offer versioning capabilities. Ground truth data would be available by attributes such as observation period, geography, feature types and other metadata below:

Market

The market I was most familiar with is the buy side in financial services– precision is of utmost importance in hedge funds, and as one moves through decreasing levels of asset volatility by manager (asset management, equity analysis, mutual funds, pension funds, etc…), this degree of required precision drops. The reason is pretty simple: a mutual fund will typically hold a position far longer than a hedge fund, so volatility is absorbed by the broader market. The intra-day/minute/second opportunities are not what low volitility investors are looking for. With capital on the line and a specific trading strategy to deploy it, quant-oriented funds embrace these fluctuations where signal can be exploited.

In broad brush strokes I identify precision and accuracy as represented below. The important thing to note, which few outside finance seems to understand, is that the buy side is incredibly diverse in terms of sophistication. “Hedge fund” is getting closer to a meaningless term as it tends to intoxicate the mind with ideas of high finance. The number of actors capable of ingesting raw data is very small. And funds can be difficult to work with for a variety of reasons, but if one is able to cross the chasm and emerge as table stakes, like 1010data, you are fortunate.

Because of limited (and varying) sophistication, the broader market may not have a statistical desire to incorporate ground truth. As the customer base widens, the requirements become more flexible. Hence the looming question of how large is this market?

Competition/Staying Power

While the business opportunity presents many of the things I love (explore new markets, create data arbitrage opportunities and create second-order data products), it is fundamentally a data aggregation/noramalization play. I am skeptical of the barriers to entry, yet privileged relationships with sources of truth (quantity and or quality) and hyper-efficient ETL can be defensible. To generate revenue, sophisticated customers would have to find that aggregated ground truth would sufficiently increase confidence in models. Backtesting and sensitivity analysis is a PhD dissertation on its own, so the potential R&D effort to get to a validation point could be significant.

As trust is developed with key customers, the plan would be to co-discover what complementary data/services would be of value–ground truth that supports alpha hunting in different industries and geographies are obvious places to start.

Sales and Distribution

I’m enamored with the adoption of iPython/Jupyter in the data science community, and selling an interface directly to the people who know what they want sounds compelling. I’ll call this the Twilio model; it can be great until you have higher ticket items that can’t be provisioned by a dev. However, technical users are generally not buyers in financial services. Those who tend to be less technical control budgets, so this means channel awareness becomes important, with some kind of limited distribution arrangements through large financial services information providers as a sort of freemium model.

What To Do?

For the time being anyway, this isn’t something I am pursuing. I’m at least partially guilty of over-analyzing before I get into market, or perhaps it just wasn’t exciting enough to me. The clear unbridled rush to get anything alt-data related into a pipeline is a gravy train, but that doesn’t mean it is a good one to jump on. Asking customers to take a chance on something which could increase model confidence (maybe) 10% in the abstract might be worth it, but when stacked against other opportunities, it might not be.

This is an area of data science that I find especially interesting–the more we can do to tame the spurious correlation, the better, but it involves a great deal of experimentation and trial and error, something industry is less excited support.

I’d love to hear any thoughts, feedback or criticisms you have!

Virginia Carlson

Data for communities | ethics in civic data/tech | Rust Belt

6 年

OK so I'm back from a LinkedIn/other career stuff hiatus (3 points of contact when on a ladder is a real thing), and I see your stuff. And I immediately think about the summer of 2013 when I worked at Data Science for Social Good and I had to convince data students that a reported pothole NE pothole. And I had to coach them on how to ground-truth data. And now I see ground-truthing is a real thing, and you're way ahead of the curve ... love it.?

1 次回应

Ian White

3x Founder | Building the Future of Fintech, Mobility & Logistics | Data-driven Solutions

7 年

Laura N. Lucy Wang Jiajia Zhao Tammer Kamel would love your thoughts about how increased ground truth impacts confidence in ML models when total observations exceeds recorded observations!

Jeremy Baksht

One API for your Supply Chain | Co-Founder & CEO @ Catena

7 年

Solid post. Looking forward to reading more about your new company.

查看更多评论

要查看或添加评论，请登录

Ian White的更多文章

Koffie Series A: Learning from our past, looking ahead

2022年7月20日

Koffie Series A: Learning from our past, looking ahead

I'm thrilled to announce Koffie’s $11m Series A. Since founding the company with Mike Dorfman in 2019, we’ve always…

17 条评论
Introducing Koffie

2021年2月23日

Introducing Koffie

Trucking is an essential part of the economy across North America. Over two-thirds of all freight is transported by…

6 条评论
Potemkin Villages and Signal Obfuscation

2018年3月12日

Potemkin Villages and Signal Obfuscation

[cross posted to my blog for easier reading] The quest for non-traditional data to understand and exploit opportunity…

1 条评论
Recycled News/Whining: Illegal Fishing is Hampered by Data Gaps

2018年1月30日

Recycled News/Whining: Illegal Fishing is Hampered by Data Gaps

Quartz reporting on an often ignored, but tragic issue in the world’s oceans. The story is based on a report by the UK…

2 条评论
Statistical Insignificance, Graphic Novel Edition

2017年8月4日

Statistical Insignificance, Graphic Novel Edition

[Ed: Improved readability at post-employment.com] For those living under a rock [Ed: note irony, as it’s doubtful this…

1 条评论
When Data Science Alone Won’t Cut it: Deriving Signal from Observations in the Maritime Domain

2017年7月10日

When Data Science Alone Won’t Cut it: Deriving Signal from Observations in the Maritime Domain

[For easier reading, cross-posted on my blog] I recently read an article (paywall) in the WSJ about Paul Allen’s Vulcan…
You Say Predicting, I say Reporting

2017年6月19日

You Say Predicting, I say Reporting

The more the world of Big Data/novel analytic techniques/machine learning is internalized, the greater the likelihood…

2 条评论
Idea Graveyard: Leadgen for Cloud Infrastructure

2017年6月9日

Idea Graveyard: Leadgen for Cloud Infrastructure

This is the first in a series of posts about businesses that weren’t for a variety of reasons. You can read about…
Government Stats Are Ready for Change (Book Review)

2017年4月4日

Government Stats Are Ready for Change (Book Review)

For those of you similarly interested (obsessed?) with the changing role of government statistics relative to the…
Proxy Indicators: beware of spurious claims

2017年3月6日

Proxy Indicators: beware of spurious claims

I recently stumbled across a research paper, Using Deep Learning and Google Street View to Estimate the Demographic…

5 条评论

See all articles

Ground Truth Aggregation

Ian White

3x Founder | Building the Future of Fintech, Mobility & Logistics | Data-driven Solutions

What is Ground Truth, and Why Do I Care?

Product

Market

Competition/Staying Power

Sales and Distribution

What To Do?

Ian White的更多文章

社区洞察

其他会员也浏览了

From Data-Rich to Data-Centric: How AI-Powered Behavioral Finance is Revolutionizing Wealth Management

Deep Knowledge Group Newsletter October 2022

Deep Knowledge Group November 2022 Newsletter

AI Stock Analyst by Anablock: Your Guide to Smarter Investing

The New Gold Rush? Wall Street Wants your Data

India’s Decimal Point Analytics receives grant funding to develop tokenization solutions for $100+ billion traditional assets market

Geopolitical Events and Cash Flow Valuations: Signal or Noise

What's new at Quant Strat's New York?

AI in Alternatives: 2024 Survey Highlights Investor Perceptions, Expectations, and Current Use Cases

ExtractAlpha's 2023 in review

What is Ground Truth, and Why Do I Care?

Product

Market

Competition/Staying Power

Sales and Distribution

What To Do?

Ian White的更多文章

Koffie Series A: Learning from our past, looking ahead

Introducing Koffie

Potemkin Villages and Signal Obfuscation

Recycled News/Whining: Illegal Fishing is Hampered by Data Gaps

Statistical Insignificance, Graphic Novel Edition

When Data Science Alone Won’t Cut it: Deriving Signal from Observations in the Maritime Domain

You Say Predicting, I say Reporting

Idea Graveyard: Leadgen for Cloud Infrastructure

Government Stats Are Ready for Change (Book Review)

Proxy Indicators: beware of spurious claims

社区洞察

其他会员也浏览了

From Data-Rich to Data-Centric: How AI-Powered Behavioral Finance is Revolutionizing Wealth Management

Deep Knowledge Group Newsletter October 2022

Deep Knowledge Group November 2022 Newsletter

AI Stock Analyst by Anablock: Your Guide to Smarter Investing

The New Gold Rush? Wall Street Wants your Data

India’s Decimal Point Analytics receives grant funding to develop tokenization solutions for $100+ billion traditional assets market

Geopolitical Events and Cash Flow Valuations: Signal or Noise

What's new at Quant Strat's New York?

AI in Alternatives: 2024 Survey Highlights Investor Perceptions, Expectations, and Current Use Cases

ExtractAlpha's 2023 in review