Data Science: Q&A
Aerial tramway La Grave France - https://commons.wikimedia.org/wiki/File%3AAerial_tramway_La_Grave_France.jpg By NielsB (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Data Science: Q&A

I was kindly asked by Prof. Roberto Zicari to answer a few questions on Data Science and Big Data for www.odbms.org - Let me know what you think of it, looking forward to your feedback in the comment below. Cheers, Natalino

Q1. Is domain knowledge necessary for a data scientist?

It’s not strictly necessary, but it does not harm either. You can produce accurate models without having to understand the domain. However, some domain knowledge will speed up the process of selecting relevant features and will provide a better context for knowledge discovery in the available datasets.

Q2. What should every data scientist know about machine learning?

First of all, the foundation: statistics, algebra and calculus. Vector, matrix and tensor math is absolutely a must. Let’s not forget that datasets after all can be handled as large matrices! Moving on specifically on the topic of machine learning: a good understanding of the role of bias and variance for predictive models. Understanding the reasons for models and parameters regularization. Model cross-validation techniques. Data bootstrapping and bagging. Also I believe that cost based, gradient iterative optimization methods are a must, as they implement the “learning” for four very powerful classes of machine learning algorithms: glm’s, boosted trees, svm and kernel methods, neural networks. Last but not least an introduction to Bayesian statistics. I know it's quite a list, but hey, data scientists are not made in a day.

Q3. What are the most effective machine learning algorithms?

Regularized Generalized Linear Models, and their further generalization as Artificial Neural Networks (ANN’s), Boosted and Random Forests. Also I am very interested in dimensionality reduction and unsupervised machine learning algorithms, such as T-SNE, OPTICS, and TDA.

Q4. What is your experience with data blending?

Blending data from different domains and sources might increases the explanatory power of the model. However, it’s not always easy to determine beforehand if this data will improve the models. Data blending provide more features and the may or may be not correlated with what you wish to predict. It’s therefore very important to carefully validate the trained model using cross validation and other statistical methods such as variances analysis on the augmented dataset.

Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?

Let’s tackle feature extraction and feature engineering separately. Extraction can be as simple as getting a number of fields from a database table, and as complicated as extracting information from a scanned paper document using OCR and image processing techniques. Feature extraction can easily be the hardest task in a given data science engagement.

Extracting the right features and raw data fields usually requires a good understanding of the organization, the processes and the physical/digital data building blocks deployed in a given enterprise. It’s a task which should never be underestimated as usually the predictive model is just as good as the data which is used to train it.

After extraction, there comes feature engineering. This step consists of a number of data transformations, oftentimes dictated by a combination of intuition, data exploration, and domain knowledge. Engineered features are usually added to the original samples’ features and provided as the input data to the model.

Before the renaissance of neural networks and hierarchical machine learning, feature engineering was as the models were too shallow to properly transform the input data in the model itself. For instance, decision trees can only split data areas along the features’ axes, therefore to correctly classify donut shaped classes you will need feature engineering to transform the space to polar coordinates.

In the past years, however, models usually have multiple layers, as machine learning experts are deploying increasingly “deeper” models. Those models usually can “embed” feature engineering as part of the internal state representation of data, rendering manual feature engineering less relevant. For some examples applied to text check the section “Visualizing the predictions and the “neuron” firings in the RNN” in The Unreasonable Effectiveness of Recurrent Neural Networks. These models are also usually referred as “end-to-end” learning, although this definition it’s still vague not unanimously accepted in the AI and Data Science communities.

So what about feature engineering today? Personally, I do believe that some feature engineering is still relevant to build good predictive systems, but should not be overdone, as many features can be now learned by the model itself, especially in the audio, video, text, speech domains.

I have answered 10 more questions. Interested? Keep reading on:

https://www.natalinobusa.com/2017/02/data-science-q-natalino-busa.html


?

Ian Dove

Managing Partner Interim Solutions at Top of Minds

8 年

thanks Natalino - accessible and thought provoking as always

要查看或添加评论,请登录

Nate Busa的更多文章

  • Full-Stack RAN and AI

    Full-Stack RAN and AI

    A new era for telecom Telecom is changing rapidly. With the demand for faster and more reliable networks, Radio Access…

    2 条评论
  • The AI Renaissance of 2025

    The AI Renaissance of 2025

    As we begin this year, I’d like to share a hopeful perspective on why AI is poised to become more impactful and…

    3 条评论
  • CTO life: how to hack it.??

    CTO life: how to hack it.??

    Today, I am reflecting on the role of a CTO. I’ve come to appreciate it as much more than a technical position—it's a…

    4 条评论
  • Data Science powered APIs with Jupyter

    Data Science powered APIs with Jupyter

    Last year, in august I had the pleasure and the honor to present at the first Jupyter conference in New York…

    1 条评论
  • Predicting Defaulting on Credit Cards

    Predicting Defaulting on Credit Cards

    When customers come in financial difficulties, it usually does not happen at once. There are indicators which can be…

    10 条评论
  • The AI scene in the valley: A trip report

    The AI scene in the valley: A trip report

    A few weeks back I was lucky enough to attend and present at the Global AI Summit in the bay area. This is my personal…

    7 条评论
  • AI Q&A: Natalino Busa

    AI Q&A: Natalino Busa

    In preparation to my next talk at the Global Artificial Intelligence(AI) Conference on January 19th, January 20th, &…

  • Looking Back 2016, Looking Forward 2017

    Looking Back 2016, Looking Forward 2017

    2016 has been simply incredible. What you will read next is a summary of my journey last year.

    1 条评论
  • The Data Science Singularity

    The Data Science Singularity

    This year I have been so kindly invited for a keynote talk at Big Data Spain which will be held in Madrid 17-18 of…

    3 条评论
  • Containers as a Service: Swarm vs Kubernetes vs Mesos vs Fleet vs Yarn

    Containers as a Service: Swarm vs Kubernetes vs Mesos vs Fleet vs Yarn

    Containerized applications allow a better utilization of resources with less middleware with respect to the well known…

社区洞察

其他会员也浏览了