SQL & Python Pandas: A Beginner's Tutorial Using the Titanic Dataset
Recently, I had a few discussions about how to get started with data analytics tools, and to address those questions, I’ve created this comprehensive tutorial for beginners. This guide covers the fundamentals of SQL and Python Pandas for data analysis, using the famous Titanic dataset to demonstrate common data operations in both tools. By the end of this tutorial, you'll have the knowledge to switch seamlessly between SQL and Python Pandas, taking advantage of their strengths in your data analysis projects.
Introduction
Let’s explore two traditional tools for data analysis:
This tutorial aims to bridge the gap between these two tools by demonstrating how common data operations can be performed in both SQL and Python Pandas using the Titanic dataset. Whether you're more comfortable with SQL or Python, this guide will help you harness the full potential of both.
Titanic dataset
The sinking of the Titanic on April 15, 1912, remains one of history's most tragic shipwrecks. Considered "unsinkable," the RMS Titanic collided with an iceberg during her maiden voyage, leading to the deaths of 1,502 of the 2,224 passengers and crew. Limited lifeboat availability and other factors meant that survival chances varied across different groups.
Download the Titanic dataset at:
or
Save it as “titanic.csv” and place it in the project directory.
Titanic dataset explained:
Setting Up Your Environment
Before diving into data analysis, it's essential to set up your working environment.
A) Installing Python
To get started, ensure that Python is installed on your system.
python --version
You should see the installed Python version.
B) Setting Up a Virtual Environment
Using virtual environments is a best practice to manage dependencies and avoid project conflicts.
Create a Virtual Environment:
python -m venv titanic_env
This command creates a new virtual environment named titanic_env.
Activate the Virtual Environment:
Windows:
titanic_env\Scripts\activate
macOS/Linux:
source titanic_env/bin/activate
Note: After activation, your terminal prompt will be prefixed with (titanic_env).
Deactivate the Virtual Environment:
deactivate
C) Installing Necessary Libraries
With your virtual environment activated, install the required Python libraries using pip.
Upgrade pip:
python.exe -m pip install --upgrade pip
Install Pandas and Other Libraries:
pip install pandas numpy matplotlib seaborn
Install SQLite3 (Optional): SQLite comes pre-installed with Python's standard library. However, if you need additional tools or GUIs:
Using SQLite with Python: Python's sqlite3 module lets you interact directly with SQLite databases from your Python scripts.
import sqlite3
# Connect to a database (or create it)
conn = sqlite3.connect('titanic.db')
cursor = conn.cursor()
Loading the Titanic Dataset
The Titanic dataset is a classic dataset in data science. It contains information about passengers aboard the Titanic, including whether they survived the tragic sinking. We'll use this dataset to perform our SQL and Pandas queries.
Using Python (Pandas)
Import Pandas:
import pandas as pd
Load the Dataset:
# Load the dataset from a CSV file
titanic_passengers = pd.read_csv('titanic.csv')
Notes:
Explore the Dataset:
# Display the first few rows
print(titanic_passengers.head())
# Get summary statistics
print(titanic_passengers.describe())
# Check for missing values
print(titanic_passengers.isnull().sum())
Using SQL
Initialize SQLite Database:
If you haven't already, create a new SQLite database and a table to hold the Titanic data.
import sqlite3
import pandas as pd
# Connect to SQLite database
conn = sqlite3.connect('titanic.db')
cursor = conn.cursor()
# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS titanic_passengers (
PassengerId INTEGER PRIMARY KEY,
Survived TEXT,
Pclass INTEGER,
Name TEXT,
Sex TEXT,
Age REAL,
SibSp INTEGER,
Parch INTEGER,
Ticket TEXT,
Fare REAL,
Cabin TEXT,
Embarked TEXT
)
''')
conn.commit()
Import Data into SQLite:
# Load data using Pandas
titanic_passengers = pd.read_csv('titanic.csv')
# Insert data into SQLite table
titanic_passengers.to_sql('titanic_passengers', conn, if_exists='replace', index=False)
Verify Data Import:
# Execute a simple query
cursor.execute('SELECT * FROM titanic_passengers LIMIT 5;')
rows = cursor.fetchall()
for row in rows:
print(row)
Output:
(343, 'No', 2, 'Collander, Mr. Erik Gustaf', 'male', 28.0, 0, 0, '248740', 13.0, None, 'S')
(76, 'No', 3, 'Moen, Mr. Sigurd Hansen', 'male', 25.0, 0, 0, '348123', 7.65, 'F G73', 'S')
(641, 'No', 3, 'Jensen, Mr. Hans Peder', 'male', 20.0, 0, 0, '350050', 7.8542, None, 'S')
(568, 'No', 3, 'Palsson, Mrs. Nils (Alma Cornelia Berglund)', 'female', 29.0, 0, 4, '349909', 21.075, None, 'S')
(672, 'No', 1, 'Davidson, Mr. Thornton', 'male', 31.0, 1, 0, 'F.C. 12750', 52.0, 'B71', 'S')
?
Alternatively, using a SQLite client:
Basic Data Queries: SQL vs Pandas
In this section, we'll perform a series of common data operations using both SQL and Python Pandas. For each operation, we'll present the SQL statement followed by its Pandas equivalent.
1. Limiting Rows
Limiting the number of rows returned by a query is useful for previewing data or debugging.
SQL:
SELECT * FROM titanic_passengers LIMIT 5;
SELECT * FROM titanic_passengers LIMIT 20;
Python (Pandas):
# First 5 rows
query = titanic_passengers.head()
# First 20 rows
query01 = titanic_passengers.head(20)
Explanation:
2. Selecting Specific Columns
Selecting specific columns allows you to focus on relevant data attributes.
Example A
SQL:
SELECT Name, Pclass, Survived FROM titanic_passengers;
Python (Pandas):
query02 = titanic_passengers[['Name', 'Pclass', 'Survived']]
print(query02)
Example B
SQL:
SELECT PassengerId, Name, Age FROM titanic_passengers;
Python (Pandas):
query03 = titanic_passengers[['PassengerId', 'Name', 'Age']]
Explanation:
3. Handling Unique Values
Retrieving unique values from a column is essential for understanding categorical data.
SQL:
SELECT DISTINCT Pclass FROM titanic_passengers;
Execute in Python Environment:
?import sqlite3
# Connect to SQLite database
conn = sqlite3.connect('titanic.db')
cursor = conn.cursor()
# Execute a simple query
cursor.execute('''
SELECT Name, Sex, Age, Pclass, Fare
FROM titanic_passengers
WHERE Fare < 8;
''')
rows = cursor.fetchall()
for row in rows:
print(row)
Python (Pandas):
?# Using drop_duplicates()
query04 = titanic_passengers['Pclass'].drop_duplicates().reset_index(drop=True)
# Optimized approach using unique()
query04_optimized = titanic_passengers['Pclass'].unique()
Explanation:
?
4. Filtering Data Based on Criteria
Filtering data based on specific conditions helps in isolating subsets of data for analysis.
Example A: Fare < 8
SQL:
SELECT Name, Sex, Age, Pclass, Fare
FROM titanic_passengers
WHERE Fare < 8;
Python (Pandas):
query05 = titanic_passengers.loc[titanic_passengers['Fare'] < 8, ['Name', 'Sex', 'Age', 'Pclass', 'Fare']]
?Example B: Fare > 250 and Embarked in ('S', 'C')
SQL:
SELECT Name, Age, Pclass, Fare, Embarked
FROM titanic_passengers
WHERE Fare > 250 AND (Embarked = 'S' OR Embarked = 'C');
Python (Pandas):
query06 = titanic_passengers.loc[
(titanic_passengers['Fare'] > 250) &
(titanic_passengers['Embarked'].isin(['S', 'C'])),
['Name', 'Age', 'Pclass', 'Fare', 'Embarked']
]
Explanation:
?
5. Filtering with Substrings
Filtering rows based on substring matches within a column is useful for text-based queries.
SQL:
SELECT Name, Age, Pclass, SibSp, Parch
FROM titanic_passengers
WHERE Name LIKE '%Fortune%';
Python (Pandas):
query07 = titanic_passengers.loc[
titanic_passengers['Name'].str.contains('Fortune', case=False, na=False),
['Name', 'Age', 'Pclass', 'SibSp', 'Parch']
]
Explanation:
?
6. Aggregate Functions
Aggregate functions perform calculations on a set of values to return a single value.
Example A: Maximum Fare
SQL:
SELECT MAX(Fare) AS max_fare FROM titanic_passengers;
Python (Pandas):
query08 = titanic_passengers['Fare'].max()
Example B: Count of Survived Passengers
SQL:
SELECT COUNT(PassengerId) AS survived_passengers
FROM titanic_passengers
WHERE Survived = 'Yes';
Python (Pandas):
query09 = titanic_passengers.loc[titanic_passengers['Survived'] == 'Yes', 'PassengerId'].count()
Example C: Average Fare by Passenger Class
SQL:
SELECT Pclass AS passenger_class, AVG(Fare) AS average_price
FROM titanic_passengers
GROUP BY Pclass;
Python (Pandas):
query10 = titanic_passengers.groupby('Pclass')['Fare'].mean().reset_index(name='average_price').rename(columns={'Pclass': 'passenger_class'})
Explanation:
领英推荐
?
7. Sorting Data
Sorting data helps organize and rank information based on specific columns.
SQL:
SELECT Survived, Name, Age, Sex
FROM titanic_passengers
WHERE Survived = 'Yes'
ORDER BY Age DESC
LIMIT 10;
Python (Pandas):
query11 = titanic_passengers.loc[
titanic_passengers['Survived'] == 'Yes',
['Survived', 'Name', 'Age', 'Sex']
].sort_values(by='Age', ascending=False).head(10)
Explanation:
?
8. Handling Missing Values
Handling missing data is crucial for accurate analysis.
Counting Missing Ages
SQL:
SELECT COUNT(PassengerId) AS no_age_passengers
FROM titanic_passengers
WHERE Age IS NULL;
Python (Pandas):
# Method 1: Using isnull() and sum()
query12 = titanic_passengers['Age'].isnull().sum()
# Method 2: Using notnull() and len()
query12a = len(titanic_passengers) - titanic_passengers['Age'].notnull().sum()
Identifying Columns with Missing Values
Python (Pandas):
missing_values = titanic_passengers.isnull().sum()
print(missing_values)
Output:
PassengerId0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Explanation:
?
Practice Queries
Now that we've covered basic queries, let's apply these concepts with more complex practice queries.
I. Count of Passengers Embarked from Each Port
Ports:
SQL:
SELECT Embarked, COUNT(*) AS passenger_count
FROM titanic_passengers
WHERE Embarked IN ('C', 'S', 'Q')
GROUP BY Embarked
ORDER BY passenger_count DESC;
Python (Pandas):
practice_query01 = (
titanic_passengers
.loc[titanic_passengers['Embarked'].isin(['C', 'S', 'Q']), ['Embarked', 'PassengerId']]
.groupby('Embarked')['PassengerId']
.count()
.reset_index(name='passenger_count')
.sort_values('passenger_count', ascending=False)
)
print(practice_query01)
Explanation:
?
II. Average Age of Men & Women in Each Passenger Class
SQL:
SELECT Pclass, Sex, AVG(Age) AS average_age
FROM titanic_passengers
GROUP BY Pclass, Sex
ORDER BY Pclass ASC;
Python (Pandas):
practice_query02 = (
titanic_passengers
.groupby(['Pclass', 'Sex'])['Age']
.mean()
.reset_index(name='average_age')
.sort_values('Pclass')
)
print(practice_query02)
Explanation:
?
III. Top 20 Passengers Who Paid the Highest Fares
SQL:
SELECT Name, Fare
FROM titanic_passengers
ORDER BY Fare DESC
LIMIT 20;
Python (Pandas):
practice_query03 = (
titanic_passengers
.loc[:, ['Name', 'Fare']]
.sort_values('Fare', ascending=False)
.head(20)
)
print(practice_query03)
Explanation:
?
IV. Average Age of 3rd Class Non-Surviving Men
SQL:
SELECT AVG(Age) AS average_age
FROM titanic_passengers
WHERE Sex = 'male' AND Survived = 'No' AND Pclass = 3;
Python (Pandas):
practice_query04 = (
titanic_passengers
.loc[
(titanic_passengers['Sex'].str.lower() == 'male') &
(titanic_passengers['Survived'].str.lower() == 'no') &
(titanic_passengers['Pclass'] == 3),
'Age'
]
.mean()
)
print(practice_query04)
Explanation:
?
V. Average Age of 1st Class Survived Women
SQL:
SELECT AVG(Age) AS average_age
FROM titanic_passengers
WHERE Sex = 'female' AND Survived = 'Yes' AND Pclass = 1;
Python (Pandas):
practice_query05 = (
titanic_passengers
.loc[
(titanic_passengers['Sex'].str.lower() == 'female') &
(titanic_passengers['Survived'].str.lower() == 'yes') &
(titanic_passengers['Pclass'] == 1),
'Age'
]
.mean()
)
print(practice_query05)
Explanation:
Best Practices
Adhering to best practices ensures that your data analysis is efficient, readable, and maintainable. Below are some best practices for both Python Pandas and SQL.
A) Python Pandas Best Practices
Import Libraries Efficiently:
import pandas as pd
import numpy as np
Use Meaningful Variable Names:
Chain Methods for Readability:
result = (
titanic_passengers
.loc[titanic_passengers['Fare'] > 100]
.groupby('Pclass')['Fare']
.mean()
.reset_index()
)
Handle Missing Data Appropriately:
# Drop rows with missing Age
titanic_clean = titanic_passengers.dropna(subset=['Age'])
# Fill missing Age with the median age
titanic_passengers['Age'].fillna(titanic_passengers['Age'].median(), inplace=True)
Vectorize Operations:
# Instead of looping to create a new column
titanic_passengers['AgeGroup'] = titanic_passengers['Age'].apply(lambda x: 'Adult' if x >= 18 else 'Minor')
Use .copy() When Slicing DataFrames:
subset = titanic_passengers.loc[titanic_passengers['Pclass'] == 1].copy()
Leverage Built-in Functions:
# Get counts of each passenger class
class_counts = titanic_passengers['Pclass'].value_counts()
Comment and Document Your Code:
def calculate_average_fare(df):
"""
Calculates the average fare per passenger class.
Parameters:
df (DataFrame): The Titanic passengers DataFrame.
Returns:
DataFrame: A DataFrame with passenger class and average fare.
"""
return df.groupby('Pclass')['Fare'].mean().reset_index(name='average_fare')
B) SQL Best Practices
Use Aliases for Readability:
SELECT Pclass AS passenger_class, AVG(Fare) AS average_fare
FROM titanic_passengers
GROUP BY Pclass;
Format Queries for Clarity:
SELECT
Pclass AS passenger_class,
AVG(Fare) AS average_fare
FROM
titanic_passengers
GROUP BY
Pclass;
Avoid Using SELECT *:
SELECT Name, Age, Fare FROM titanic_passengers;
Use Proper Indexing:
CREATE INDEX idx_survived ON titanic_passengers(Survived);
Limit the Use of Subqueries:
-- Instead of a subquery
SELECT Name, Age
FROM titanic_passengers
WHERE Pclass = (SELECT MIN(Pclass) FROM titanic_passengers);
-- Use JOIN
SELECT tp.Name, tp.Age
FROM titanic_passengers tp
JOIN (SELECT MIN(Pclass) AS min_pclass FROM titanic_passengers) sub
ON tp.Pclass = sub.min_pclass;
Handle NULLs Appropriately:
SELECT Name, Age
FROM titanic_passengers
WHERE Age IS NOT NULL;
Use Transactions for Data Integrity:
BEGIN TRANSACTION;
-- Insert a new passenger
INSERT INTO titanic_passengers (PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked)
VALUES (892, 'No', 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, '330911', 7.8292, NULL, 'Q');
-- Update fare for 1st class passengers
UPDATE titanic_passengers
SET Fare = Fare * 1.1
WHERE Pclass = 1;
COMMIT;
Comment Complex Queries:
-- Calculate the average fare per passenger class
SELECT Pclass, AVG(Fare) AS average_fare
FROM titanic_passengers
GROUP BY Pclass;
In this tutorial, we've delved into the fundamental data analysis capabilities of both SQL and Python Pandas using the Titanic dataset. By comparing equivalent operations in both tools, we've highlighted their respective strengths and demonstrated how they can complement each other in a data analyst's toolkit.
Key Takeaways:
Additional Resources
To further enhance your understanding of SQL and Python Pandas, consider exploring the following resources:
SQL Resources:
Python Pandas Resources
?
Combined SQL and Pandas Resources
Happy data analyzing!
?
Neven Dujmovic, October 2024
#Python #PythonPandas #SQL #DataScience #TitanicDataset #DataAnalysis #LearnSQL #PythonProgramming #AnalyticsTools #DataManipulation