登录查看更多内容

Data Science: Q&A

Nate Busa

Director, Tech and Digital at Neom | AI @Stanford | CTO Program @Wharton | Technology Strategy, Innovation, Product Development

发布日期: 2017年2月2日

I was kindly asked by Prof. Roberto Zicari to answer a few questions on Data Science and Big Data for www.odbms.org - Let me know what you think of it, looking forward to your feedback in the comment below. Cheers, Natalino

Q1. Is domain knowledge necessary for a data scientist?

It’s not strictly necessary, but it does not harm either. You can produce accurate models without having to understand the domain. However, some domain knowledge will speed up the process of selecting relevant features and will provide a better context for knowledge discovery in the available datasets.

Q2. What should every data scientist know about machine learning?

First of all, the foundation: statistics, algebra and calculus. Vector, matrix and tensor math is absolutely a must. Let’s not forget that datasets after all can be handled as large matrices! Moving on specifically on the topic of machine learning: a good understanding of the role of bias and variance for predictive models. Understanding the reasons for models and parameters regularization. Model cross-validation techniques. Data bootstrapping and bagging. Also I believe that cost based, gradient iterative optimization methods are a must, as they implement the “learning” for four very powerful classes of machine learning algorithms: glm’s, boosted trees, svm and kernel methods, neural networks. Last but not least an introduction to Bayesian statistics. I know it's quite a list, but hey, data scientists are not made in a day.

Q3. What are the most effective machine learning algorithms?

Regularized Generalized Linear Models, and their further generalization as Artificial Neural Networks (ANN’s), Boosted and Random Forests. Also I am very interested in dimensionality reduction and unsupervised machine learning algorithms, such as T-SNE, OPTICS, and TDA.

Q4. What is your experience with data blending?

Blending data from different domains and sources might increases the explanatory power of the model. However, it’s not always easy to determine beforehand if this data will improve the models. Data blending provide more features and the may or may be not correlated with what you wish to predict. It’s therefore very important to carefully validate the trained model using cross validation and other statistical methods such as variances analysis on the augmented dataset.

Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?

Let’s tackle feature extraction and feature engineering separately. Extraction can be as simple as getting a number of fields from a database table, and as complicated as extracting information from a scanned paper document using OCR and image processing techniques. Feature extraction can easily be the hardest task in a given data science engagement.

Extracting the right features and raw data fields usually requires a good understanding of the organization, the processes and the physical/digital data building blocks deployed in a given enterprise. It’s a task which should never be underestimated as usually the predictive model is just as good as the data which is used to train it.

After extraction, there comes feature engineering. This step consists of a number of data transformations, oftentimes dictated by a combination of intuition, data exploration, and domain knowledge. Engineered features are usually added to the original samples’ features and provided as the input data to the model.

Before the renaissance of neural networks and hierarchical machine learning, feature engineering was as the models were too shallow to properly transform the input data in the model itself. For instance, decision trees can only split data areas along the features’ axes, therefore to correctly classify donut shaped classes you will need feature engineering to transform the space to polar coordinates.

In the past years, however, models usually have multiple layers, as machine learning experts are deploying increasingly “deeper” models. Those models usually can “embed” feature engineering as part of the internal state representation of data, rendering manual feature engineering less relevant. For some examples applied to text check the section “Visualizing the predictions and the “neuron” firings in the RNN” in The Unreasonable Effectiveness of Recurrent Neural Networks. These models are also usually referred as “end-to-end” learning, although this definition it’s still vague not unanimously accepted in the AI and Data Science communities.

So what about feature engineering today? Personally, I do believe that some feature engineering is still relevant to build good predictive systems, but should not be overdone, as many features can be now learned by the model itself, especially in the audio, video, text, speech domains.

I have answered 10 more questions. Interested? Keep reading on:

https://www.natalinobusa.com/2017/02/data-science-q-natalino-busa.html

Ian Dove

Managing Partner Interim Solutions at Top of Minds

8 年

thanks Natalino - accessible and thought provoking as always

1 次回应

要查看或添加评论，请登录

Nate Busa的更多文章

Full-Stack RAN and AI

2025年1月16日

Full-Stack RAN and AI

A new era for telecom Telecom is changing rapidly. With the demand for faster and more reliable networks, Radio Access…

2 条评论
The AI Renaissance of 2025

2025年1月1日

The AI Renaissance of 2025

As we begin this year, I’d like to share a hopeful perspective on why AI is poised to become more impactful and…

3 条评论
CTO life: how to hack it.??

2024年5月24日

CTO life: how to hack it.??

Today, I am reflecting on the role of a CTO. I’ve come to appreciate it as much more than a technical position—it's a…

4 条评论
Data Science powered APIs with Jupyter

2018年1月11日

Data Science powered APIs with Jupyter

Last year, in august I had the pleasure and the honor to present at the first Jupyter conference in New York…

1 条评论
Predicting Defaulting on Credit Cards

2017年5月16日

Predicting Defaulting on Credit Cards

When customers come in financial difficulties, it usually does not happen at once. There are indicators which can be…

10 条评论
The AI scene in the valley: A trip report

2017年2月7日

The AI scene in the valley: A trip report

A few weeks back I was lucky enough to attend and present at the Global AI Summit in the bay area. This is my personal…

7 条评论
AI Q&A: Natalino Busa

2017年1月3日

AI Q&A: Natalino Busa

In preparation to my next talk at the Global Artificial Intelligence(AI) Conference on January 19th, January 20th, &…
Looking Back 2016, Looking Forward 2017

2016年12月29日

Looking Back 2016, Looking Forward 2017

2016 has been simply incredible. What you will read next is a summary of my journey last year.

1 条评论
The Data Science Singularity

2016年11月16日

The Data Science Singularity

This year I have been so kindly invited for a keynote talk at Big Data Spain which will be held in Madrid 17-18 of…

3 条评论
Containers as a Service: Swarm vs Kubernetes vs Mesos vs Fleet vs Yarn

2016年10月10日

Containers as a Service: Swarm vs Kubernetes vs Mesos vs Fleet vs Yarn

Containerized applications allow a better utilization of resources with less middleware with respect to the well known…

See all articles

Data Science: Q&A

Nate Busa

Director, Tech and Digital at Neom | AI @Stanford | CTO Program @Wharton | Technology Strategy, Innovation, Product Development

Nate Busa的更多文章

社区洞察

其他会员也浏览了

Data Scientist’s Dilemma: The Cold Start Problem – Ten Machine Learning Examples

Top Data Science & AI Trends For 2022

Autoencoders in TensorFlow 2, Product-Oriented Data Science, and East 2022 Keynote Recaps

The biggest misconception in learning the mathematical foundations of data science which no one tells you is ..

Data representation

The Hidden Truth About Data Science (That No One Talks About!)

The Connection Between Machine Learning and Statistics

Hypothesis Testing in Machine Learning

ML Engineer vs Data Scientist

My short book review

Nate Busa的更多文章

Full-Stack RAN and AI

The AI Renaissance of 2025

CTO life: how to hack it.??

Data Science powered APIs with Jupyter

Predicting Defaulting on Credit Cards

The AI scene in the valley: A trip report

AI Q&A: Natalino Busa

Looking Back 2016, Looking Forward 2017

The Data Science Singularity

Containers as a Service: Swarm vs Kubernetes vs Mesos vs Fleet vs Yarn

社区洞察

其他会员也浏览了

Data Scientist’s Dilemma: The Cold Start Problem – Ten Machine Learning Examples

Top Data Science & AI Trends For 2022

Autoencoders in TensorFlow 2, Product-Oriented Data Science, and East 2022 Keynote Recaps

The biggest misconception in learning the mathematical foundations of data science which no one tells you is ..

Data representation

The Hidden Truth About Data Science (That No One Talks About!)

The Connection Between Machine Learning and Statistics

Hypothesis Testing in Machine Learning

ML Engineer vs Data Scientist

My short book review