The Tutorial I Wish I Had If I Was Starting Data Science From Scratch
Cidália Eusébio
Research Fellow @ London South Bank University | NHS Clinical Entrepreneur'24 | Nurse, Public Health Specialist, Data Science
Starting out in data science can be overwhelming due to the volume of information available, from books and online resources to various tools and technologies.
The key is to understand how to start and which tools are most helpful for your area of work.
? When I began my journey, I, too, struggled to figure out what to focus on and how to build on my existing knowledge. The lack of a clear framework made it difficult to know where to start or what I truly needed to learn.?
That's why I’ve created this tutorial, designed for those just stepping into the field, to help demystify data science. Here, you’ll explore what data science really is, why it’s important, and how it works in practice. I'll introduce you to some of the most useful tools and technologies that data scientists use daily and guide you on where to find datasets for practice.?
There are plenty of opportunities to improve your learning for free, like attending bootcamps and workshops. This tutorial is a step-by-step guide to help you find your footing, see if you like it and build a strong foundation in data science, just like I wish I had when I first started.
What is Data Science?
Data science is often referred to as the “sexy” field of statistics because of its broad applications and exciting potential. But at its core,
data science it’s an interdisciplinary field that merges statistics, computer science, and the specific domain expertise you may already have, whether that’s finance, healthcare, economics, or any other field.
The goal is to extract meaningful insights from data and use those insights to solve complex problems and drive informed decision-making.
The process involves several key steps: collecting, cleaning, analyzing, visualizing, and modeling data. These stages enable us to take raw data and transform it into actionable knowledge. While you may have worked with other tools like Stata, SPSS, or even Excel—tools that offer some analytical capabilities—data science tools go beyond what these packages can do. Data science gives you a much wider range of possibilities, allowing you to tackle deeper and more complex problems that traditional software solutions may struggle with.
Key Components:
Data Collection: Gathering data from various sources such as databases, APIs, or web scraping.
Data Cleaning:? Handling missing values, removing duplicates, and correcting errors to prepare data for analysis.
Data Analysis: Using statistical methods to explore and understand data patterns and relationships.
Data Visualisation: Representing data through charts, graphs, and dashboards to communicate insights clearly.
Modelling: Building machine learning models to make predictions or classify information based on data.
The Role of a Data Scientist
Think of a data scientist as a storyteller. Their role is to uncover and communicate the story hidden within the data. They start by identifying the problem that needs solving and then work at the intersection of programming, statistics, and domain expertise. By doing this, they can analyze and make sense of complex data, develop predictive models and algorithms, and, most importantly, communicate the findings clearly to stakeholders.
It's not just about crunching numbers; it's about crafting a narrative that explains what the data is revealing. A data scientist translates these insights into actionable, data-driven solutions that address real business challenges.
This storytelling aspect is what makes data science so powerful—it turns raw data into knowledge that can drive strategic decision-making.
Real-World Applications and Impact
One of the biggest advantages of learning data science is its versatility. Whether you have a background in finance, economics, marketing, healthcare, social science, or any other field, data science gives you the tools to apply your domain knowledge in new and powerful ways. With data science, you can explore endless possibilities—whether it’s for personal projects like investing, understanding housing market trends, or for more professional purposes like developing business strategies based on solid data.
The applications of data science stretch across almost every industry. Here are a few examples:
Healthcare: Predicting patient outcomes, personalised medicine, and optimising hospital operations.
Finance: Fraud detection, risk management, and algorithmic trading.
Marketing: Customer segmentation, recommendation systems, and campaign optimization.
Social Science: Analysing social trends, sentiment analysis, and policy evaluation.
The beauty of this skill set is that it’s transferable—once you know how to use data science, you can switch between industries with ease.
So if you ever get tired of working in one sector, you can pivot to another and apply the same skills in a completely new context. It’s this flexibility that makes data science such an appealing and valuable field to master. Understanding these applications helps you see the tangible benefits of data science and motivates learning.
Overview of Tools and Technologies
To do data science, you'll need a few key components: software, programming languages, and libraries.
The software acts as your workspace, where you can write and run your code. Some popular options include Visual Studio, Anaconda, or even the integrated software on platforms like Kaggle, which provides a built-in environment for data science projects.
Once you have the software, the next step is to choose a programming language. In data science, the most commonly used languages are Python and R. SQL is often used for data extraction, but Python and R are the go-to languages for data manipulation, analysis, and modeling.
Within those programming languages, you'll use specific libraries that are designed to handle different tasks in data science. For Python, libraries like Pandas and NumPy are essential for data manipulation, allowing you to clean and organize your data. To visualize your data, libraries like Matplotlib and Seaborn are invaluable. When it comes to building predictive models and making predictions, libraries like Scikit-learn and TensorFlow come into play.
Just like in any project, you pick the right tools for the right task, and these libraries are what help you execute different steps in the data science process efficiently.
领英推荐
Programming Languages
Libraries and Frameworks
Opportunities for Training
When it comes to training in data science, there are two main approaches. One option is the free route, such as government-funded bootcamps in the UK, which provide hands-on project experience and teach technical skills from the ground up. These bootcamps are fantastic for beginners, but in my experience, they often miss the critical thinking element necessary for making sure you're asking the right questions, using the best techniques, and interpreting data accurately to achieve meaningful results. It’s not just about learning how to build models; it’s about understanding why you're building them and ensuring they’re both efficient and relevant to the problem you're solving.
To complement this, I highly recommend looking into specialized books, particularly from O'Reilly Publisher. The Hands on Machine Learning, for example, was extremely helpful for me in my early and more complex projects. It provides solid frameworks that guide you through key steps like cleaning, handling, and preprocessing data, ensuring the results you obtain are as accurate and reliable as possible.
For additional practice, platforms like Kaggle are another great free resource. Kaggle offers numerous real-world projects and publicly available datasets to help you build your portfolio. the Uk also provides a wealth of publicly available data, such as population statistics,public health, healthcare, which can be very useful for hands-on practice.
On the other hand, the paid route involves pursuing formal education like a Master's in Machine Learning, Artificial Intelligence, or Data Science, or even a PhD for those looking to specialize further. Both paths provide excellent opportunities to develop your skills and advance your career.
Lastly, don’t forget about networking. In cities like London, there are many events where you can meet people already working in the industry, expand your network, and find opportunities. For online courses, I also suggest Coursera and edX. One of the best courses I’ve taken is the Machine Learning course from Deep Learning, which focuses on machine learning. These platforms offer a variety of courses, from data science and mathematics to more advanced, specialised topics. They’re affordable, and the quality of content makes them a great return on investment.
How to get started from scratch?
The key to learning data science is simply getting started. If you already have expertise in a specific field, such as economics, healthcare, marketing, music,? real estate or others , use that domain knowledge to your advantage. Look for free datasets related to your field and start working on projects. Since you are familiar with the industry, you'll likely already have an idea of the kinds of questions to ask, whether from articles you’ve read or from everyday problems you’ve encountered.
Basic knowledge of statistics is also very helpful, as it makes it easier to understand which tools to use and how to apply descriptive statistics to your data. But just like anything else, data science is a skill that can be learned with consistent practice, and there are plenty of resources and courses online to help. Tools like ChatGPT can also be helpful for quick learning and asking questions on the go as well as to help with coding, but remember to double-check the information you find using reliable books, expert websites, or specialized YouTube channels.
Starting out is simple—you just need to take the first step. Below, you’ll find a few initial actions to take today, along with project ideas to help you get familiar with the process. You’ll learn how to install the right software, set up the necessary libraries, and access publicly available datasets for your projects.?
Setting Up Your Computer for the Workshop
Programming Environment
Choose one of the following software options:
Anaconda:
Install Anaconda(https://www.anaconda.com/products/individual) to get Python/R and essential packages - once installed, open Jupyter.
Jupyter Notebooks: Use Jupyter for an interactive coding experience within your browser. Once you install Jupyter and open Python, you will need to install each library you use, for example:
pip install numpy
Today, you can install libraries like numpy, pandas, and matplotlib. Always remember to import the libraries in each notebook you use, for example:
e.g., import numpy as np
You’re now ready to go! Use cheat sheets to help you navigate the code.
??or
Visual Studio Code:
Similar setup to Jupyter, but you need to install Python, Jupyter, and R. The steps follow the same pattern. https://code.visualstudio.com
Access to Publicly Available Data
Create an Account and Explore Datasets
Titanic: Machine Learning from Disaster(https://www.kaggle.com/c/titanic): This classic beginner project will help you get started with classification models and predictive analytics. It’s useful across multiple fields and introduces the basics of machine learning.
Diabetes Health Indicators Dataset(https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset): Perfect for those in healthcare, this dataset allows you to explore risk factors for diabetes and practice classification techniques.
Retail Sales Forecasting (https://www.kaggle.com/competitions/demand-forecasting-kernels-only: For those interested in marketing and sales, this project will allow you to predict sales for a retail company, helping you practice time-series analysis.
Start a New Notebook: Once you select a dataset, click on "New Notebook" to start coding directly in the browser with Python or R. Kaggle Notebooks come pre-installed with the necessary libraries, making it easy to get started.
Learn from Others: Explore public notebooks shared by other users to see different approaches and participate in discussions to ask questions or share insights. This is a great way to see real-world problem-solving in action.
Document and Share: As you work through your analysis, document your findings. Once you’re satisfied, share your notebook publicly on Kaggle to showcase your work to potential employers and the broader data science community.
Now that you have a broad understanding of what data science is and how to get started, it’s time to take that first step—just start! I hope you find this tutorial useful, whether you're following it during our current bootcamp or if you’ve come across it on your own. Data science is a skill that grows with practice, and the journey becomes more rewarding the further you go.
Remember, consistency is the key. Keep exploring new concepts, learning from various resources, and building projects. With dedication and curiosity, you'll find yourself advancing quickly in the field of data science. Enjoy the process, and I hope this guide has provided the direction you need to kickstart your journey!
Photos by Pexels
Biomedical Scientist, PhD | Business Dev | BioTech | HealthTech | Community Manager, Top Female Founders, Sustainable Development & Healthcare Innovation
1 个月Very informative ??