Data Quality is your greatest ‘Edge’ in an otherwise Efficient-Leaning Marketplace

Data Quality is your greatest ‘Edge’ in an otherwise Efficient-Leaning Marketplace

The Efficient Markets Hypothesis (EMH) asserts that financial markets swiftly incorporate all available information, efficiently reflecting it in asset prices. The reality bears this out – most market edges are quickly exploited as they become widely known and replicated. While the EMH suggests it is challenging to consistently outperform the market, the quality of data becomes vital in exploiting market inefficiencies. This mental model is very useful for thinking about 'data driven' businesses and the opportunities they hope to find. Accurate and high-quality data is crucial for identifying mispriced assets and opportunities for gaining a competitive edge. If competitors possess superior data quality or a more streamlined CI/CD pipeline for pushing richer models to production, they will outperform and capture more (of your!) market share. Recognizing the importance of data quality and investing in robust data science led data infrastructure becomes imperative for modern businesses to compete effectively and safeguard their financial success. Data quality is not a ‘nice to have’ – it is a must have for the modern data-driven business, structured as a cash generation machine.

?

As an example, one common pitfall in data science is not paying sufficient attention to the definition of the target variables - this can introduce noise in otherwise good models and dilutes their performance. A lack of clarity around business process definitions can lead to major problems, including insidious things such as target leaks. All of these ultimately impact profitability and performance. Target leaks occur when the target variable used for modeling includes information that would not be available at the time of prediction or decision-making. These leaks distort the accuracy of models and hinder their ability to generate reliable predictions. By taking the time to accurately define target variables and ensuring they are aligned with the intended business outcomes, organizations can mitigate such inefficiencies, enhance model performance, and maximize profitability. Worse, by not correctly defining your target or your model event space, you are solving for the wrong question entirely. People assume they are measuring one thing, when in fact they are tracking something quite different. A statistician’s mind/lens is required to accurately define the predictors, the event space of the transactions/operations being modeled, and the target variables in such a way as to address the original business question precisely. If you bring in inexperienced analysts, data engineers, or data scientists that can code well enough but have too little statistical domain knowledge to know how to think properly – this carries over into the task of modeling the event space, or in exploiting the actual market inefficiency to derive a statistical edge. Businesses then complain that they aren’t seeing the ROI the market promises for being ‘data-driven’. This fine distinction can be the difference between dying by a thousand cuts and achieving a billion-dollar valuation, in a market that is very unforgiving. Being data-driven starts with correct hiring. You may think you’re saving money when you get a discount on your data scientist or data engineer, but you’re paying for it in terms of your downstream ROI, which may never materialize because you didn’t hire correctly, or you didn’t follow-through on the plan. Because knowledge and experience compound, a single employee at $200K can be more useful than two junior employees at $100K each. Saving $100K to lose out on $1Bn is the definition of penny-wise and pound foolish. That extra up-front cost may be critical to solid execution.

?

Being truly data-driven means being led by data science at all levels, including correctly and precisely engineering the data for a model-driven business. Without a data science-led approach, businesses cannot effectively engineer their data, including accurate business process ontologies, predictor variables, and target variables, to meet the needs of a model-driven business that operates like a cash generating machine. This comprehensive understanding of data science percolating into the data engineering layer enables businesses to optimize their data infrastructure, identify relevant variables, and develop accurate models that drive profitable operations. When I originally coined the word 'data engineering' I was referring to the meticulous art and science of constructing a well curated dataset with the right statistical characteristics that included event space engineering and target engineering -- but the IT industry has high-jacked my word and turned it into a reduction of itself that has led to the massive accumulation of technical debt that doesn't serve data science! By embracing a data science-led approach, organizations can unlock the true potential of their data and harness its power to generate substantial value and drive business success. Data science doesn’t just inform strategy but is critical for proper execution – and as such, must be involved in designing the execution pipeline/machine. If it is just data science in one ear, and business as usual out the other, then it doesn’t do you any good!

?

“We have lots of data, we have IT assets dedicated to data, we use all the latest BI tools, we have analysts…” is not what characterizes a modern data-driven cash machine. Being data-driven goes beyond having IT assets and BI tools. It involves fostering a culture that values data-driven decision making – which includes re-designing the machine itself – not just throwing a data scientist at a bad machine; developing a clear data strategy, integrating, and harmonizing data across epochs, interventions, and change histories; leveraging advanced analytics and deep science; and continuously learning and adapting to reality. Legacy businesses may think they are data-driven, but without embracing these broader aspects, they fall short – badly short. True data-driven organizations prioritize data culture (not just parroting the words ‘I love data!’ or 'data is important!'); they take pains to define data strategy that is well integrated with their tactical execution machinery – they have an organic in-house build plan (not just a buy plan!) – they also invest heavily in resisting/eliminating so-called ‘technical debt’; data science led infrastructure choices – again, tech debt often arises from poor choices made without data science in mind (your tech debt will kill you!); data science led data governance and data curation; cross-domain accessibility (un-silo your data! and monetize it as you do!), multi-disciplinary experienced data scientists (hiring well! really, really well!); and agility (tied to good infrastructure and pipelines!). By understanding the full scope of being data-driven, businesses can bridge the gap and harness the power of data to drive meaningful outcomes and stay competitive in the market.


Modernizing data infrastructure is a critical investment for businesses aiming to leverage data science and gain a competitive edge. By aligning the data infrastructure with the needs of a data-driven business, organizations can unlock the full potential of their data assets, while gaining a significant and potent competitive edge.

?

Working with hundreds of millions of records and re-shaping data on the fly is not a trivial task. Data scientists re-organize and transform data iteratively many times before generating a single production model. If they aren’t equipped with the correctly architected scalable infrastructure to do this, they can never get a model out the door. The problem is it takes time and resources and focus to dig yourself out of accumulated technical debt preventing you from operating in the right way.

Investing in modern data infrastructure empowers businesses to unlock the full potential of their data assets, enabling them to leverage data science, gain actionable insights, make informed decisions, and drive innovation. It sets the foundation for a data-driven business edge, positioning organizations for success in today's competitive and rapidly evolving landscape.

Leading this blog with the words "Data Quality" and finishing it on a "Data Infrastructure, Data Engineering" note is not a mistake. ALL of this is related to data quality, and data quality is related to the performance of your machine in an efficient-leaning marketplace! All of this is tied together under the larger umbrella of good Data Science. Building your data-driven business is building a cash-generating machine, the heart of which relies on a set of models that make it react in real time to finite market opportunities, all of which depend on the quality of your data. To get the outcomes you want, you have to reverse engineer your business from the ground up to serve data. When you serve your data well, it will serve you back, in kind, many fold.

A data scientist who understands these nuances, must rebuild your business processes from the bottom-up to serve the science, to serve the model (or set of models) at the heart of the cash machine. As an old boss of mine often said, “Hope is not a business strategy!” Solid data science led infrastructure and engineering is foundational to the smooth operation of a well-oiled machine.

Bojan Duric

Chief Data Officer @ CoVB

1 年

Pradyumna S. Upadrashta, Ph.D. Well said ??. Cheers

Naveen Nagarajan, CFA

Director at Capital One | Product Management and Strategy | Quantitative Analytics

1 年

Very insightful article across different dimensions.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了