The Accidental Data Scientists

The Accidental Data Scientists

If you are on the younger side, you may have chosen to study data science and to become a data scientist. But if you are a bit older like me, you might have studied something else entirely—in my case, economics and finance over the course of a bachelor’s degree, master’s degree, and Ph.D.—and then realized that the methods that data scientists developed were superior to the methods of interrogating data that you had been taught. Circumstance made you a data scientist. You worked in a field that transitioned from humans being cost-effective and first-in-class decision makers to algorithms usurping our spot. You worked in a field where the empirical methods began advancing far faster than the rest of the field.

Ours is an interesting and important story, and I would argue that we live in a place of strength. Far from being split helplessly above a chasm between machine learning and domain expertise, we are a necessary bridge between the two worlds that allows us to build stronger models. We are cyborgs, straddling human expertise and machine learning algorithms. We are accidental data scientists, falling into a role out of a necessity borne of advancements in empirical methods. I cannot tell everyone’s story, but I can tell my own. And I welcome you to do the same.

If you are like me, you have spent the past few years, reading and watching videos on random forest and gradient boosting, transformers and attention, convolutional neural networks, Adam versus Rectified Adam, and swish vs. leaky ReLU activation functions. You watched Ken Jee and Statquest to understand more about the field and methods. You realized there was a gap between the models that you had been taught when you were in school and the current state of the art.

Over the majority of the past fifteen years, I used economic intuition and occasionally structural models as the basis for building strategies and portfolios. I would think carefully about what statistical test I needed to run to ensure that I had corrected for multiple testing or autocorrelation and heteroskedasticity of standard errors. If the standard tools failed, bootstrapping standard errors was also an option. All testing was done in-sample, so tremendous effort was expended trying to understand significance of model parameters.

These days, understanding Newey-West standard errors or Fama-Macbeth or understanding most asset pricing models is no longer necessary for effective model creation. In-sample fit is a metric used for debugging, not as metric viewed as relevant to expected model performance. Statistical significance of particular model parameters is forgotten. All prediction is focused on cross-validation sets. (Finance journal articles including those that I write still largely live in the old world and for generally understandable reasons which I will not discuss now.)

The Role of Cyborgs

All is not lost for us cyborgs. We hold three great strengths in my mind. The first is feature engineering. Practically, in quantitative equity management, we are unlikely to engineer any more financial and technical signals. There are a few hundred of them total, and machine learning algorithms can already infer whatever combination of those signals we might think provides a sharper signal. Instead, feature engineering now comes primarily from alternative data. You still use the standard signals. They are still valuable, but new signals will come from new untapped datasets.

Our comparative advantage is knowing what is likely to be predictive. In equity return prediction, the most predictive measures in roughly descending order are smart money flows, analyst forecasts, market data, and profitability metrics. There are many other signals and those are important too, but these generally have the biggest bang for their buck, so that compartmentalizes my search space in a way that someone without domain expertise is unlikely to have.

Our second strength is all the things beyond the prediction model. In quantitative equity management, currently, covariance estimation and portfolio optimization are not pure machine learning games. There are many domain-specific things to consider like taxation, transaction costs, and liquidity requirements.

Our final strength goes beyond what a typical domain expert can do. Pure domain experts often engineer linear combinations of signals, which will not improve a machine learning model’s predictability. Pure domain experts are often mystified by the outputs of machine learning models. They may not know how to properly cross-validate tests and parameters. That is where our machine learning understanding comes into play.

The cyborg is useful. The domain and ML expert combined into one person has a unique value that is not captured by having a separate domain expert and a separate ML expert.

What should you learn to be a quant finance cyborg?

I cannot speak for all domains but imagine you want to go into quantitative investment management. You want to be both a domain expert and ML expert. What should you start with? Machine learning. If you work alongside investment management quants, we can fill you in on all the domain-specific issues that you run into. I would much rather have an ML expert with no investment management background than an investment management expert with no ML background.

ML is somewhat generalizable whereas quantitative finance is exceedingly specific. If you learn a lot about bonds, that will barely be useful to you if you want to be a quant focused on equities or options or commodity futures. Spend 90% of your time studying the generalizable, technically difficult field. Spend 10% of your time studying the domain. When you start working in the field, those numbers will flip, so if you do not already have a good sense of machine learning, you will get relegated to the domain expert who does not understand machine learning, which is a fine place to be if you are happy with it, but you are probably here because you want something that embraces both domain expertise and machine learning.

Some Parting Words…

I hope this resonated to my fellow accidental data scientists out there. I spent much of my years of education studying methods that quickly became outdated. In those situations, we have a choice. We can entrench ourselves and convince ourselves and those around us that nothing has changed, and the old methods passed down from the ancients are still optimal. Or we can get curious and learn about where our understanding falls short. We have chosen the latter.

Thomas Arnold, PhD

Full Stack Data Scientist | Rising Health Risk is Predictable | Let me show you how to do it!

3 年

I was thinking about posting a survey on LinkedIn asking data scientists whether they were "accidental data scientists" or "intentional data scientists." I suppose that it would vary by age, since the field is so new. Thanks for sharing your creation story.

Gordon Ross, CFA

Data Analyst, Investment Practitioner

3 年

Thank you for your breadth of insight, Vivek. I now aspire to be an accidental data scientist also.

Farshad Saadatmand

Business Analyst @ MTI | PhD, MBA

3 年

I think your story greatly sheds light on the current existing gap in today’s investment management industry. Many experts in finance are facing a critical decision now: evolution or extinction. Glad you’re going with the former. Thanks Vivek Viswanathan for sharing this post!

Chethan Pai

Associate Principal, Analytics Research at SimCorp | Columbia University

3 年

I can totally resonate with this situation. I am going back to school this fall after working for several years in FI portfolio performance analytics space, and the feature engineering is exactly driving my course selection. Thank you for this post.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了