Data Science: Q&A
I was kindly asked by Prof. Roberto Zicari to answer a few questions on Data Science and Big Data for www.odbms.org - Let me know what you think of it, looking forward to your feedback in the comment below. Cheers, Natalino
Q1. Is domain knowledge necessary for a data scientist?
It’s not strictly necessary, but it does not harm either. You can produce accurate models without having to understand the domain. However, some domain knowledge will speed up the process of selecting relevant features and will provide a better context for knowledge discovery in the available datasets.
Q2. What should every data scientist know about machine learning?
First of all, the foundation: statistics, algebra and calculus. Vector, matrix and tensor math is absolutely a must. Let’s not forget that datasets after all can be handled as large matrices! Moving on specifically on the topic of machine learning: a good understanding of the role of bias and variance for predictive models. Understanding the reasons for models and parameters regularization. Model cross-validation techniques. Data bootstrapping and bagging. Also I believe that cost based, gradient iterative optimization methods are a must, as they implement the “learning” for four very powerful classes of machine learning algorithms: glm’s, boosted trees, svm and kernel methods, neural networks. Last but not least an introduction to Bayesian statistics. I know it's quite a list, but hey, data scientists are not made in a day.
Q3. What are the most effective machine learning algorithms?
Regularized Generalized Linear Models, and their further generalization as Artificial Neural Networks (ANN’s), Boosted and Random Forests. Also I am very interested in dimensionality reduction and unsupervised machine learning algorithms, such as T-SNE, OPTICS, and TDA.
Q4. What is your experience with data blending?
Blending data from different domains and sources might increases the explanatory power of the model. However, it’s not always easy to determine beforehand if this data will improve the models. Data blending provide more features and the may or may be not correlated with what you wish to predict. It’s therefore very important to carefully validate the trained model using cross validation and other statistical methods such as variances analysis on the augmented dataset.
Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?
Let’s tackle feature extraction and feature engineering separately. Extraction can be as simple as getting a number of fields from a database table, and as complicated as extracting information from a scanned paper document using OCR and image processing techniques. Feature extraction can easily be the hardest task in a given data science engagement.
Extracting the right features and raw data fields usually requires a good understanding of the organization, the processes and the physical/digital data building blocks deployed in a given enterprise. It’s a task which should never be underestimated as usually the predictive model is just as good as the data which is used to train it.
After extraction, there comes feature engineering. This step consists of a number of data transformations, oftentimes dictated by a combination of intuition, data exploration, and domain knowledge. Engineered features are usually added to the original samples’ features and provided as the input data to the model.
Before the renaissance of neural networks and hierarchical machine learning, feature engineering was as the models were too shallow to properly transform the input data in the model itself. For instance, decision trees can only split data areas along the features’ axes, therefore to correctly classify donut shaped classes you will need feature engineering to transform the space to polar coordinates.
In the past years, however, models usually have multiple layers, as machine learning experts are deploying increasingly “deeper” models. Those models usually can “embed” feature engineering as part of the internal state representation of data, rendering manual feature engineering less relevant. For some examples applied to text check the section “Visualizing the predictions and the “neuron” firings in the RNN” in The Unreasonable Effectiveness of Recurrent Neural Networks. These models are also usually referred as “end-to-end” learning, although this definition it’s still vague not unanimously accepted in the AI and Data Science communities.
So what about feature engineering today? Personally, I do believe that some feature engineering is still relevant to build good predictive systems, but should not be overdone, as many features can be now learned by the model itself, especially in the audio, video, text, speech domains.
I have answered 10 more questions. Interested? Keep reading on:
https://www.natalinobusa.com/2017/02/data-science-q-natalino-busa.html
?
Managing Partner Interim Solutions at Top of Minds
8 年thanks Natalino - accessible and thought provoking as always