Why good physicists make good data scientists?

Why good physicists make good data scientists?

An academic background in physics is often mentioned as one of the preferred qualifications in the requirements for Data Science Senior+ level vacancies. Personally, I know a lot of good physicists who have successfully transitioned into Data Science. On the contrary, however, I do not know anyone who has moved into this field and stayed in Junior or Middle positions for long periods of time. This could be due to sampling bias. Let's try to determine the reasons behind this.

Some of these reasons are quite obvious, such as good mathematical training and structured thinking. Problem-solving skills and experience in experimental planning also play a role.

To discover less obvious reasons, let's think about who becomes a good physicist? Usually, people who are genuinely passionate about understanding how the world works. This passion drives them to explore a wide range of topics, even those far beyond their formal scientific field of interest. After all, how else can you discover and understand the often unusual connections between different entities and the laws of our world? To understand how this works, be sure to read "Surely you're joking, Mr. Feynman!", by Nobel Prize-winning physicist Richard Feynman, if you have not already.

But are broad horizons and the ability to find unexpected connections between different concepts really that critical for Data Science? In fact, both are very important. And here's why. One of the major differences between real-world ML and academic ML is the almost unlimited possibilities for creating features. In almost all organizations, we have access to hundreds of metrics potentially related to the subject of forecasting, from which we can assemble features that describe the most complex patterns your imagination can imagine. If we also take into account the possibility of enriching our internal data with external data, the only limit is the cost of purchasing such data.

Let's consider this with the example of telecommunications data. It's no secret that telecommunications and banking themselves possess a vast amount of knowledge about their customers.

  • Based on demographic data from the contract and social data based on calls and SMS, this data can be used to create a lot.
  • By fuzzy matching the sequence and time of changing base stations, we can suggest who else this subscriber spends the evening with. Especially if they have calls to each other and especially recently.
  • You know the route and average speed at which the subscriber returns home. And you can even guess where they ran a red light, because they continued moving, while other subscribers in their area stopped.
  • With DPI, you can understand a subscriber's interests, although with the introduction of the https protocol, the task has become much more complex, and the quality of the information received is generally much lower.
  • You know which gas stations they usually refuel at, which cafés they usually visit after work, where they usually go before that, and where they like to go afterwards.

Now, let's add external data here. The number of new opportunities is literally exploding, but I want to highlight two:

  • Currently, almost all bank notifications are carried out by push messages if the bank app is installed on a phone. But, at a time when SMS dominated banking, you can use fuzzy matching based on time and location, as well as exact matching based on the amount of a check with the details provided by the fiscal data provider to determine what the customer bought at the supermarket and what croissant they ordered at the cafe.
  • From the accelerometer data on the phone, we know where the user had to slow down suddenly and how often they drop the phone. Data from the gyroscope, in addition to geolocation analytics, will help distinguish between these two.

It is clear that all of the heuristics described above have a certain percentage of false positive and false negative errors. However, what is important for us right now is that ideas for potential features of models are usually difficult to count literally. At the same time, collecting, validating, storing, and exploring each feature requires a significant amount of resources. That is why it is so crucial to quickly identify the most promising features, as well as to correctly rank them in priority. It's difficult to imagine how this could be done without extensive knowledge of the domain area, which is gained through both Data Scientist's own research and experience, as well through close communication with colleagues from different departments, with excellent general education, broad outlook, and the ability to uncover and verify non-obvious patterns. These are the qualities that manifest the physicist's mind at its best.

Don't get me wrong. I don't mean to imply that data scientists with other scientific backgrounds do not exhibit similar qualities. This would be contrary to even my own experience. It's simply that in this article, I wanted to discuss the influence of the physicist's perspective on the issue raised above.

There are aspects of the physicists' mindset that need to be altered to become a truly exceptional data scientist. This can be illustrated by an anecdote featuring the hero, who is the renowned Soviet physicist and another Nobel laureate, Lev Davidovich Landau.

Once an experimenter caught Landau in the corridor and asked him to explain the graph on a piece of paper. Landau explained. "But you're holding the chart upside down!" exclaimed the experimenter. Landau turned the paper over and explained again.

The ability to provide a plausible interpretation of even erroneous phenomena requires physicists to develop special caution regarding the quality of initial data and a high degree of criticality towards their own conclusions. However, with experience, this can even turn into a positive.


Marc C.

Macroeconomic Risk | AI, Deep Learning, Quant Enthusiast

1 年

Economists as well ;)

要查看或添加评论,请登录

Ilia Ekhlakov的更多文章

社区洞察

其他会员也浏览了