Operational Research: important tools for data scientists

Operational Research: important tools for data scientists

When you read about data scientist the focus is mostly on prediction of values (i.e., extrapolating or interpolating values of an unknown function). However, almost every important data science project in Agoda contains some big constraint optimization problem, which needs to be solved in a big data environment. In the academia these type of problems are studied under the discipline of Operational Research (OR) so one (naively?) would expected to see OR topics mentioned in data science focused CVs; alas, among the hundreds resumes I reviewed in the last year only a handful mentioned optimization skills explicitly. Why? I'm not sure really but here is a curriculum I pulled from the Galvanize data science program:

  • Week 1 - Exploratory Data Analysis and Software Engineering Best Practices
  • Week 2 - Statistical Inference, Bayesian Methods, A/B Testing, Multi-Armed Bandit
  • Week 3 - Regression, Regularization, Gradient Descent
  • Week 4 - Supervised Machine Learning: Classification, Validation, Ensemble Methods
  • Week 5 - Clustering, Topic Modeling (NMF, LDA), NLP
  • Week 6 - Network Analysis, Matrix Factorization, and Time Series
  • Week 7 - Hadoop, Hive, and MapReduce
  • Week 8 - Data Visualization with D3.js, Data Products, and Fraud Detection Case Study
  • Weeks 9-10 - Capstone Projects
  • Week 12 - Onsite Interviews

The focus is on machine learning, big data tools, and some statistics - no OR/Optimization. The situation is not different in other data science curriculums (e.g., Insight and Metis).

My personal belief is that a well-rounded (senior) data scientist should have some OR skills; it is certainty something I often test for in my interviews. Following are my recommendations for useful OR tools (I've used them all throughout the years):

  1. Lagrange multipliers and the more general Karush–Kuhn–Tucker conditions (which are mostly theoretical but serve as a good foundation to understand constraint optimization over smooth functions).
  2. Linear programming and the much harder (as in NP-complete) integer programming.
  3. Flow networks and the max-flow/min-cut algorithm (these can be reduced to linear programming but still an important concept).
  4. Multi-objective optimization.

(re-posted from my new-ish blog; see link in my profile)

And this is why CPLEX is part of DSX the IBM Data Science platform

Peter Cacioppi

Customized Analytics for Supply Chain and Optimization

8 年

I think part of the problem is that there aren't sufficient open-source libraries focused on the needs of optimization models. Optimization nearly always requires very clean data that conforms to a specific schema. This is different from ML, whose algorithms tend to work on tables of arbitrary schema and whose techniques often incorporate strategies for basic data cleaning. I've built the ticdat package and deployed it on pypi to address this need.

Vince O'Neill

Analytics Leader | Bridging the divide between business decisions and analytics

8 年

Over the course of my career, the most impactful solutions I've created all included some element of optimization. Like you I've been surprised that it hasn't really been part of the data science playbook

Nasir Hameed Khan,

SAP |Supply Chain | Data Analytics Professional

8 年

The article is by Uri Weiss ,correction.

回复
Nasir Hameed Khan,

SAP |Supply Chain | Data Analytics Professional

8 年

Hi Bill you are right about Analytics Edge course add to that the MIT supply chain courses related to CTL.SC2x,using linear and Mixed integer linear programming for network design ,SOP and procurement. Also, the course CTL.SC0x has used MILPs and LP.Indeed this is a very good article by Prof.Watson.

回复

要查看或添加评论,请登录

Uri Weiss的更多文章

  • Explaining to 15 y/o what is AI

    Explaining to 15 y/o what is AI

    Today I went to my kids' school to present over Career's Day. It is an interesting exercise to try and explain a deep…

    6 条评论
  • Data Science picture surprise...

    Data Science picture surprise...

    The picture I choose for this post may surprise you; can you tell why? Well, the reason is simple: it features a woman…

    11 条评论
  • A/B testing for conversion rate, revisited, part 2

    A/B testing for conversion rate, revisited, part 2

    Part 1 addressed how to compute significance decision boundaries when A/B testing for changes in conversion rate (CR)…

    6 条评论
  • A/B testing for conversion rate, revisited

    A/B testing for conversion rate, revisited

    A quick refresher: Conversion Rate (CR) is the proportion of users that performed an action (typically buy/book) once…

    7 条评论
  • So, how many data scientists are out there at the end of 2016?

    So, how many data scientists are out there at the end of 2016?

    Building data science teams in Agoda often requires us to relocate candidates to Bangkok so we naturally look around…

    8 条评论

社区洞察

其他会员也浏览了