登录查看更多内容

Why good physicists make good data scientists?

Ilia Ekhlakov

Senior Data Scientist @ Wrike | B2B SaaS | Revenue Strategy & Ops | MSc in Physics | 9 YoE

发布日期: 2024年2月27日

An academic background in physics is often mentioned as one of the preferred qualifications in the requirements for Data Science Senior+ level vacancies. Personally, I know a lot of good physicists who have successfully transitioned into Data Science. On the contrary, however, I do not know anyone who has moved into this field and stayed in Junior or Middle positions for long periods of time. This could be due to sampling bias. Let's try to determine the reasons behind this.

Some of these reasons are quite obvious, such as good mathematical training and structured thinking. Problem-solving skills and experience in experimental planning also play a role.

To discover less obvious reasons, let's think about who becomes a good physicist? Usually, people who are genuinely passionate about understanding how the world works. This passion drives them to explore a wide range of topics, even those far beyond their formal scientific field of interest. After all, how else can you discover and understand the often unusual connections between different entities and the laws of our world? To understand how this works, be sure to read "Surely you're joking, Mr. Feynman!", by Nobel Prize-winning physicist Richard Feynman, if you have not already.

But are broad horizons and the ability to find unexpected connections between different concepts really that critical for Data Science? In fact, both are very important. And here's why. One of the major differences between real-world ML and academic ML is the almost unlimited possibilities for creating features. In almost all organizations, we have access to hundreds of metrics potentially related to the subject of forecasting, from which we can assemble features that describe the most complex patterns your imagination can imagine. If we also take into account the possibility of enriching our internal data with external data, the only limit is the cost of purchasing such data.

Let's consider this with the example of telecommunications data. It's no secret that telecommunications and banking themselves possess a vast amount of knowledge about their customers.

Based on demographic data from the contract and social data based on calls and SMS, this data can be used to create a lot.
By fuzzy matching the sequence and time of changing base stations, we can suggest who else this subscriber spends the evening with. Especially if they have calls to each other and especially recently.
You know the route and average speed at which the subscriber returns home. And you can even guess where they ran a red light, because they continued moving, while other subscribers in their area stopped.
With DPI, you can understand a subscriber's interests, although with the introduction of the https protocol, the task has become much more complex, and the quality of the information received is generally much lower.
You know which gas stations they usually refuel at, which cafés they usually visit after work, where they usually go before that, and where they like to go afterwards.

Now, let's add external data here. The number of new opportunities is literally exploding, but I want to highlight two:

领英推荐

What Really Is Data Science? A Super-Simple…

Bernard Marr 5 年前

Research Mindset for Data Scientist: the “First…

武攀 3 年前

What I learned analyzing the famous Titanic dataset

Murilo Gustineli 4 年前

Currently, almost all bank notifications are carried out by push messages if the bank app is installed on a phone. But, at a time when SMS dominated banking, you can use fuzzy matching based on time and location, as well as exact matching based on the amount of a check with the details provided by the fiscal data provider to determine what the customer bought at the supermarket and what croissant they ordered at the cafe.
From the accelerometer data on the phone, we know where the user had to slow down suddenly and how often they drop the phone. Data from the gyroscope, in addition to geolocation analytics, will help distinguish between these two.

It is clear that all of the heuristics described above have a certain percentage of false positive and false negative errors. However, what is important for us right now is that ideas for potential features of models are usually difficult to count literally. At the same time, collecting, validating, storing, and exploring each feature requires a significant amount of resources. That is why it is so crucial to quickly identify the most promising features, as well as to correctly rank them in priority. It's difficult to imagine how this could be done without extensive knowledge of the domain area, which is gained through both Data Scientist's own research and experience, as well through close communication with colleagues from different departments, with excellent general education, broad outlook, and the ability to uncover and verify non-obvious patterns. These are the qualities that manifest the physicist's mind at its best.

Don't get me wrong. I don't mean to imply that data scientists with other scientific backgrounds do not exhibit similar qualities. This would be contrary to even my own experience. It's simply that in this article, I wanted to discuss the influence of the physicist's perspective on the issue raised above.

There are aspects of the physicists' mindset that need to be altered to become a truly exceptional data scientist. This can be illustrated by an anecdote featuring the hero, who is the renowned Soviet physicist and another Nobel laureate, Lev Davidovich Landau.

Once an experimenter caught Landau in the corridor and asked him to explain the graph on a piece of paper. Landau explained. "But you're holding the chart upside down!" exclaimed the experimenter. Landau turned the paper over and explained again.

The ability to provide a plausible interpretation of even erroneous phenomena requires physicists to develop special caution regarding the quality of initial data and a high degree of criticality towards their own conclusions. However, with experience, this can even turn into a positive.

Marc C.

Macroeconomic Risk | AI, Deep Learning, Quant Enthusiast

1 年

Economists as well ;)

1 次回应

MiDShift | Career boost??

1 年

valuable insights Ilia Ekhlakov,

1 次回应

查看更多评论

要查看或添加评论，请登录

Ilia Ekhlakov的更多文章

Why Decision Making Requires Probabilities from Predictive Models

2025年1月6日

Why Decision Making Requires Probabilities from Predictive Models

In predictive analytics, there's often a debate: should decisions rely on raw probabilities, or are simpler approaches,…

2 条评论
Exploring the Reasons for Unexpected Prediction Distributions in Machine Learning Models

2024年11月15日

Exploring the Reasons for Unexpected Prediction Distributions in Machine Learning Models

When investigating unexpected model behavior, many Data Scientists I know start by analyzing distribution drifts in the…

5 条评论
The Hidden Pitfalls of Using Standard Metrics for Predictive Models: Understanding the Feedback Effect

2024年9月1日

The Hidden Pitfalls of Using Standard Metrics for Predictive Models: Understanding the Feedback Effect

When evaluating predictive models, relying solely on standard metrics like precision and recall can lead to misleading…

3 条评论
Mastering the Art of Target Selection for Business-Efficient Churn Model

2024年4月23日

Mastering the Art of Target Selection for Business-Efficient Churn Model

In the realm of real-world machine learning, particularly in applied settings, the process of defining a target…
Model Fairness: Navigating Business Decisions with Equity

2024年4月17日

Model Fairness: Navigating Business Decisions with Equity

The concept of model fairness has become increasingly important in the realm of machine learning and artificial…
Tackling Noisy Targets: Strategies for Robust Model Training

2024年3月28日

Tackling Noisy Targets: Strategies for Robust Model Training

Traditional loss functions such as Mean Squared Error (MSE) or Cross-Entropy are designed under the assumption of clean…

6 条评论
Why product teams are the best fit for Data Scientists

2024年3月6日

Why product teams are the best fit for Data Scientists

In my eight-year journey as a data scientist, I've witnessed the impact of different team structures firsthand. While…
Could Synthetic Tabular Data be Helpful to Cope with Small Data Challenge in Machine Learning?

2024年2月21日

Could Synthetic Tabular Data be Helpful to Cope with Small Data Challenge in Machine Learning?

Synthetic data is often touted as a remedy for the class imbalance problem. However, there are many good sources proven…

6 条评论
Uncertainty Quantification: The Key Ingredient for Reliable Data Science Predictions

2024年2月19日

Uncertainty Quantification: The Key Ingredient for Reliable Data Science Predictions

In the business domain of Data Science, we often want to calculate a metric, such as expected profit or, conversely…

6 条评论
Small Data, Big Noise: Why Feature Engineering is Your Secret Weapon in the Machine Learning Jungle

2024年2月7日

Small Data, Big Noise: Why Feature Engineering is Your Secret Weapon in the Machine Learning Jungle

Imagine sifting for gold nuggets in a riverbed. With a small pan and a lot of pebbles, it's a tedious task, requiring…

6 条评论

See all articles

Why good physicists make good data scientists?

Ilia Ekhlakov

Senior Data Scientist @ Wrike | B2B SaaS | Revenue Strategy & Ops | MSc in Physics | 9 YoE

领英推荐

Ilia Ekhlakov的更多文章

社区洞察

其他会员也浏览了

The Accidental Data Scientists

To Drive High Value from Data, be Decisions-Optimization Focused

Day 4: Unveiling the Power of Practical Mathematics for Data Scientists!

Meet Elen Shaw, Senior Data Scientist at Owlstone Medical

The Growing Demand for Data Scientists

The Sexiest job of the 21st century: Harvard Business Review

The Power of Entropy in Data Science: Insights and Applications

Graph Theory and Network Analysis in Data Science

Introduction To Graphs

Theory of Everything (ToE) Data Science Platform: Conceptual Design

领英推荐

Ilia Ekhlakov的更多文章

Why Decision Making Requires Probabilities from Predictive Models

Exploring the Reasons for Unexpected Prediction Distributions in Machine Learning Models

The Hidden Pitfalls of Using Standard Metrics for Predictive Models: Understanding the Feedback Effect

Mastering the Art of Target Selection for Business-Efficient Churn Model

Model Fairness: Navigating Business Decisions with Equity

Tackling Noisy Targets: Strategies for Robust Model Training

Why product teams are the best fit for Data Scientists

Could Synthetic Tabular Data be Helpful to Cope with Small Data Challenge in Machine Learning?

Uncertainty Quantification: The Key Ingredient for Reliable Data Science Predictions

Small Data, Big Noise: Why Feature Engineering is Your Secret Weapon in the Machine Learning Jungle

社区洞察

其他会员也浏览了

The Accidental Data Scientists

To Drive High Value from Data, be Decisions-Optimization Focused

Day 4: Unveiling the Power of Practical Mathematics for Data Scientists!

Meet Elen Shaw, Senior Data Scientist at Owlstone Medical

The Growing Demand for Data Scientists

The Sexiest job of the 21st century: Harvard Business Review

The Power of Entropy in Data Science: Insights and Applications

Graph Theory and Network Analysis in Data Science

Introduction To Graphs

Theory of Everything (ToE) Data Science Platform: Conceptual Design