Research Mindset for Data Scientist: the “First Principle” Thinking
Photo by HalGatewood.com on Unsplash

Research Mindset for Data Scientist: the “First Principle” Thinking

“Do I need a Ph.D. degree to be a Data Scientist”? This is a common question asked by many interested in joining the field. Quite a few blogs elaborate on reasons why you do not need a Ph.D. degree to become a Data Scientist (e.g. here is one) and I think those generally make sense. Still, there is something extra Ph.D. graduates develop during their multi-year research experience: the research mindsets. Since I cannot easily find a formal definition elsewhere, I will simply make it up:

The research mindsets are the thinking patterns or methods commonly used by researchers, to synthesize a holistic picture about the problem space from existing literature, propose specific questions which are valuable to be addressed, and identify innovative solutions so that our knowledge in this area could be push one step further.

Although Ph.D. experience is a good way to build up the research mindsets, primarily due to its extensive training and pressure to publish peer review papers, I don’t think it is the only route: as long as one is conscious about developing such a mindset, one can always practice it in various fields/industries. So, herein, I am going to share one research mindset: the “first principle” thinking. Keep in mind that a research mindset is not something magical that can transform one into a “superperson” in problem-solving; it is more about providing consistent perspectives to view a problem: this perspective could be unique and may lead to deeper insight or innovative solutions. For me, having research mindsets definitely helps my data scientist career. With that said, now, let’s get started.

Define the “first principle” thinking

The Wikipedia version

According to Wikipedia, “a first principle is a basic proposition or assumption that cannot be deduced from any other proposition or assumption … in physics and other sciences, theoretical work is said to be from first principles, or ab initio, if it starts directly at the level of established science and does not make assumptions such as empirical model and parameter fitting”.

By SVG by Indolences.Recoloring and ironing out some glitches done by Rainer Klute. — Own work based on: of Image: Stylised Lithium Atom.png by Halfdan., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1675352

For example, in computational chemistry (my Ph.D. research field), if a study wants to claim from the “first principle”, its computation should be based on quantum mechanics: because any chemical reaction can be viewed as a manifestation of the quantum world. Hence the “first principle” thinking in this research field is to model the nucleus and electrons using the Schrodinger equation and propagate such interactions to the chemical system and then evaluate the associated properties. In this “first principle” approach, we can see two stages:

  • (identification) one needs to identify a basic proposition or assumption first, e.g. the Schrodinger equation
  • (application) then, one needs to apply such proposition or assumption to the system to study the desired properly, e.g. run simulation to see how the chemical reaction happens

A Data Science friendly version

Data scientists usually work in industries far away from fundamental physics research, and there is usually no “Schrodinger equation” to go back to, so there need to be some adjustments on how to find a proper “basic assumption”. I think a good “basic assumption” should have three attributes:

First, a “basic assumption” should be basic/fundamental within its problem context. For example, if the problem context is “forecasting subscription revenue” for one B2B company, then each enterprise client behavior (e.g. subscription, upsell, churn) could be considered basic/fundamental; if the problem context is to “optimize the cost of building rocket”, then understanding key rocket components and how each is made of would be considered basic/fundamental. Herein, the assumption is usually at a lower level comparing to the problem context, so that once the assumption is identified, it could be applied to a higher level to help solve the problem.

Second, a “basic assumption” needs to be mathematically quantifiable, even if it may not be the best description of the reality. For example, assuming we want to study consumer behavior responding to various promotions to maximize user growth, There are two possible assumptions: 1) each consumer is rational 2) each consumer is irrational. Based on economic research, we know consumers are irrational (linklink), and their decisions are influenced by so many factors and may not even be consistent. However, it is almost impossible to model an “irrational” consumer mathematically, so such “assumption”, even it is correct, may not be used as the “basic assumption”. Meanwhile, the “rational consumer” assumption, although it has many known issues, can be modeled in mathematical terms, so along with a few corrections to account for “irrational” behavior under different circumstances, it could serve a better “basic assumption”.

Third, a “basic assumption” should be verifiable under the context. Since almost no proposition or assumption would hold true forever, we need to add one extra stage in the “first principle” thinking process in the Data Science world:

  • (validation) use data to support the chosen proposition or assumption, to verify whether it is valid under the given context and is not violated during the application stage

So, the three stages of “identification -> validation -> application” constitute the “first principle” thinking to problem-solving in the Data Science world.

Put “First principle” thinking in action

To better illustrate the “first principle” thinking flow, I am going to share one Data Science project as a case study, with the disclaimer that what stated here is a much-simplified version and could be slightly different from reality.

Context: back while I was working in the risk team at the biggest ride-hailing company (hint: name starts with a “U”) in 2016, fraud behavior went through the roof in some international markets. Due to fierce competition aboard, the company was offering big incentives to drivers alluring them to stay with the platform. Along with lucrative incentives, there comes fraudulent activities. It is commonly known that: both rider and driver could be “ghost” (be created out of thin air), where the “ghost” rider tries to request a trip only from the “ghost” driver and generate a “ghost” trip; after the “ghost” trip finishes, the “ghost” rider uses the “cash” function to pretend to pay the “ghost” driver, later the “ghost” driver takes real incentive money from the platform. The fraudsters can use advanced GPS spoofing technology to enable the above schema (link) and it is always a cat-and-mouth game to know which trip is GPS spoofed. In the following figure (credit), one can see a GPS spoofed trip, where the trip’s GPS height is above the ground so the trip is happening literally in the air.

Credit: https://eng.uber.com/advanced-technologies-detecting-preventing-fraud-uber/. Fake trip GPS as in Red color, normal trip GPS as in Green color.

Here comes a problem: we only know how many fraudulent trips happened within our existing detection technology, and don’t know how many more are still out there. This is a “we-don’t-know-the-unknown” situation, and something we would like to address. So we came up with a methodology and I will structure it using the three steps in “first principle” thinking:

Step 1. identification: to understand the platform incentive fraud prevalence, one lower level is at each trip request, i.e. whether one trip request is fraudulent or not. To successfully start a fraudulent trip, ONLY a “ghost” driver should be dispatched per the “ghost” rider’s request: they need to collude with each other. If by accident a normal driver is dispatched per the “ghost” rider’s request, the “ghost” rider will for sure, 100%, cancel the request. While such 100% cancelation behavior would not happen for normal riders, given the rider is mostly agnostic with respect to the exact driver matched to. The mathematical formula to describe the “basic assumption” is

P(rg, dn) = 100%, P(rg, dg) = 0%

Where P denotes for the rider cancellation probability, “rg” for “ghost rider”, “dg” for “ghost driver”, and “dn” for “normal driver”.

While for a normal rider, the cancellation behavior would be fundamentally different, it should be much more gradual. For simplicity, let’s assume the rider cancellation probability only depends on the rider-driver distance at the dispatch time. For example, if the rider sees the driver pretty close (e.g. 100 meters away), the rider is much less likely to cancel the request; but if the dispatched driver is still far away (e.g. 5 miles), the rider is likely to cancel the request if there could be other closer drivers. The mathematical formula can be described as:

P(rn, dn | distance=X) = P(rn, dg | distance=X)

P(rn, d | distance=X) ≤ P(rn, d | distance=X+1)

Where “rn” denotes for “normal rider” and “d” for any driver.

Step 2. validation: we can look at historic trips request, dispatch, and cancelation data, and confirmed our understanding of the normal user’s cancellation behavior. Although the specific numbers may vary under different markets due to many factors (e.g. rider expectation), the basic principle still holds.

A hypothetical curve showing rider cancellation behavior change w.r.t. rider-driver distance increase

We can also run experiments over non-fraudulent markets, where no driver incentivize exists, to further confirm the assumption that normal rider’s cancellation behavior is agnostic to the specific driver being dispatched. The key point here is that such assumptions could be validated through data.

Step 3. application: after we went through the identification and validation stages, we can apply it by introducing a treatment in the dispatch system: randomly swap out the top driver to be dispatched when other drivers are also close to the rider. In this way, even when the top driver is not dispatched, rider experience will not be dramatically impacted. Under this treatment, we can infer how many fraudulent trips are out there!

Let’s assume the normal rider cancellation rate is 5% (without the treatment), the estimated cancellation rate would increase to 6% due to dispatch distance change for normal riders, while the actually observed cancellation rate is 10%. If x is the fraudulent request % in the market, then:

We know that (with no treatment): 5% * x + 5% * (1-x) = 5%

The experiment shows that (with treatment): 100% * x + 6% * (1-x) = 10%

Then we can calculate: x = (10–6)/(100–6)=4.2%

Of course, such an approach would only be applied over a small fraction (e.g. 1%) of trip requests randomly in heavily incentivized areas, for the purpose to probe how much potential collusion behavior exists to estimate incentive fraud prevalence, and the number used here is for demonstration purpose only. The proposed treatment may also have some limitations regarding what specific fraudulent behavior could be measured, but it serves as a reasonable demonstration of how “first principle” thinking works.

A powerful hammer for the right problem

Now you know what is the “first principle” thinking and how it can be used in the Data Science context. Congratulations!

https://www.flickr.com/photos/ajay_suresh/49790892297

Meanwhile, I would also like to remind you that “First principle” thinking should never be the only perspective you hold, and under some circumstances, it could even cause more harm than the benefit: by drilling too deep into the lower level for a principle, it may unnecessarily over-complicate the problem or simply lose track of the big picture. It is like a powerful hammer, but not necessarily every problem is a nail. So don't restrict yourself to any single mindset, always be open-minded and focus on solving the problem, this would be the most effective route.

— — — — — — — — — — — — — —

The same content is also published on Medium and can be accessed here. If you enjoyed this article, please help spread the word by liking, sharing, and commenting.

Here are also a few articles you may be interested in as well:

Steve Na

Senior Staff, Data Science at LinkedIn

3 年

Nice post! Agree research mindset and first principles thinking is great but not always the right approach. I enjoy working w a diversity of perspectives including first principle thinkers.

Having a research mindset can definitely help one's data scientist career.

要查看或添加评论,请登录

武攀的更多文章

  • Horizontal Innovation in Data Science

    Horizontal Innovation in Data Science

    Innovation is a key driver of progress and can be found in every field, taking on various forms. In the context of Data…

    8 条评论
  • Driving Data Science Initiative: a Simple Four-Stage Model

    Driving Data Science Initiative: a Simple Four-Stage Model

    An “initiative” is defined as “a new plan or process to achieve something or solve a problem”, according to the…

    7 条评论
  • Problem Solving as Data Scientist: a Case?Study

    Problem Solving as Data Scientist: a Case?Study

    There are two myths about how data scientists solve problems: one is that the problem naturally exists, hence the…

    16 条评论
  • Learnings from Parenting: Clockwise

    Learnings from Parenting: Clockwise

    My son helps me to notice an unconscious egocentric bias and to become better at work One day, I opened a new toy for…

    7 条评论
  • How to innovate in Data?Science

    How to innovate in Data?Science

    Sharing my thoughts on innovation, along with one favorite Data Science project in 2014 Millions of innovations happen…

    12 条评论
  • Building my 2020 reading list with a simple Python script

    Building my 2020 reading list with a simple Python script

    Using Requests and BeautifulSoup, I extracted the historic New York Times Best Sellers (Business topic) lists to enrich…

    43 条评论
  • Using Git in Data Science: the Solo Master

    Using Git in Data Science: the Solo Master

    Git is a very popular version control system for tracking changes in computer files and coordinating work on those…

    4 条评论
  • My First Data Science Project

    My First Data Science Project

    Why am I writing the post As one who is in the Data Science field for a while, I received quite a few questions from…

    8 条评论
  • One Suggestion for Uber's Airport Pickup Experience

    One Suggestion for Uber's Airport Pickup Experience

    Why am I writing the post Yesterday, while I was at SJC airport and ready to Uber back home, one thing strikes me that…

    24 条评论

社区洞察

其他会员也浏览了