Operational Research: important tools for data scientists
When you read about data scientist the focus is mostly on prediction of values (i.e., extrapolating or interpolating values of an unknown function). However, almost every important data science project in Agoda contains some big constraint optimization problem, which needs to be solved in a big data environment. In the academia these type of problems are studied under the discipline of Operational Research (OR) so one (naively?) would expected to see OR topics mentioned in data science focused CVs; alas, among the hundreds resumes I reviewed in the last year only a handful mentioned optimization skills explicitly. Why? I'm not sure really but here is a curriculum I pulled from the Galvanize data science program:
- Week 1 - Exploratory Data Analysis and Software Engineering Best Practices
- Week 2 - Statistical Inference, Bayesian Methods, A/B Testing, Multi-Armed Bandit
- Week 3 - Regression, Regularization, Gradient Descent
- Week 4 - Supervised Machine Learning: Classification, Validation, Ensemble Methods
- Week 5 - Clustering, Topic Modeling (NMF, LDA), NLP
- Week 6 - Network Analysis, Matrix Factorization, and Time Series
- Week 7 - Hadoop, Hive, and MapReduce
- Week 8 - Data Visualization with D3.js, Data Products, and Fraud Detection Case Study
- Weeks 9-10 - Capstone Projects
- Week 12 - Onsite Interviews
The focus is on machine learning, big data tools, and some statistics - no OR/Optimization. The situation is not different in other data science curriculums (e.g., Insight and Metis).
My personal belief is that a well-rounded (senior) data scientist should have some OR skills; it is certainty something I often test for in my interviews. Following are my recommendations for useful OR tools (I've used them all throughout the years):
- Lagrange multipliers and the more general Karush–Kuhn–Tucker conditions (which are mostly theoretical but serve as a good foundation to understand constraint optimization over smooth functions).
- Linear programming and the much harder (as in NP-complete) integer programming.
- Flow networks and the max-flow/min-cut algorithm (these can be reduced to linear programming but still an important concept).
- Multi-objective optimization.
(re-posted from my new-ish blog; see link in my profile)
And this is why CPLEX is part of DSX the IBM Data Science platform
Customized Analytics for Supply Chain and Optimization
8 年I think part of the problem is that there aren't sufficient open-source libraries focused on the needs of optimization models. Optimization nearly always requires very clean data that conforms to a specific schema. This is different from ML, whose algorithms tend to work on tables of arbitrary schema and whose techniques often incorporate strategies for basic data cleaning. I've built the ticdat package and deployed it on pypi to address this need.
Analytics Leader | Bridging the divide between business decisions and analytics
8 年Over the course of my career, the most impactful solutions I've created all included some element of optimization. Like you I've been surprised that it hasn't really been part of the data science playbook
SAP |Supply Chain | Data Analytics Professional
8 年The article is by Uri Weiss ,correction.
SAP |Supply Chain | Data Analytics Professional
8 年Hi Bill you are right about Analytics Edge course add to that the MIT supply chain courses related to CTL.SC2x,using linear and Mixed integer linear programming for network design ,SOP and procurement. Also, the course CTL.SC0x has used MILPs and LP.Indeed this is a very good article by Prof.Watson.