SQL & Python Pandas: A Beginner's Tutorial Using the Titanic Dataset

SQL & Python Pandas: A Beginner's Tutorial Using the Titanic Dataset

Recently, I had a few discussions about how to get started with data analytics tools, and to address those questions, I’ve created this comprehensive tutorial for beginners. This guide covers the fundamentals of SQL and Python Pandas for data analysis, using the famous Titanic dataset to demonstrate common data operations in both tools. By the end of this tutorial, you'll have the knowledge to switch seamlessly between SQL and Python Pandas, taking advantage of their strengths in your data analysis projects.

Introduction

Let’s explore two traditional tools for data analysis:

  • SQL (Structured Query Language) is the standard language for interacting with relational databases. It's efficient for querying large datasets, performing aggregations, and managing data storage.
  • Python Pandas is a powerful library for data manipulation and analysis. It provides data structures like DataFrames, which make handling and analyzing data intuitive and flexible.

This tutorial aims to bridge the gap between these two tools by demonstrating how common data operations can be performed in both SQL and Python Pandas using the Titanic dataset. Whether you're more comfortable with SQL or Python, this guide will help you harness the full potential of both.

Titanic dataset

The sinking of the Titanic on April 15, 1912, remains one of history's most tragic shipwrecks. Considered "unsinkable," the RMS Titanic collided with an iceberg during her maiden voyage, leading to the deaths of 1,502 of the 2,224 passengers and crew. Limited lifeboat availability and other factors meant that survival chances varied across different groups.

Download the Titanic dataset at:

https://github.com/nevendujmovic/titanic/blob/main/titanic.csv

or

https://www.kaggle.com/

Save it as “titanic.csv” and place it in the project directory.

Titanic dataset explained:

  • Data: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
  • Survived (0 = No, 1 = Yes)
  • Pclass (Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd)
  • Age (age in years)???????????
  • SibSp (number of siblings/spouses aboard the Titanic????
  • Parch (number of parents/children aboard the Titanic)
  • Ticket (ticket number)
  • Fare (passenger fare)
  • Cabin (cabin number)????
  • Embarked (Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton)

Setting Up Your Environment

Before diving into data analysis, it's essential to set up your working environment.

A) Installing Python

To get started, ensure that Python is installed on your system.

  1. Download Python: Visit the official Python website and download the latest stable release compatible with your operating system (Windows, macOS, or Linux).
  2. Install Python: Run the downloaded installer. Important: During installation, check the box that says "Add Python to PATH". This allows you to use Python and pip from the command line. Follow the on-screen instructions to complete the installation.
  3. Verify Installation: Open your terminal or command prompt and run:

python --version        

You should see the installed Python version.

B) Setting Up a Virtual Environment

Using virtual environments is a best practice to manage dependencies and avoid project conflicts.

Create a Virtual Environment:

python -m venv titanic_env        

This command creates a new virtual environment named titanic_env.

Activate the Virtual Environment:

Windows:

titanic_env\Scripts\activate        

macOS/Linux:

source titanic_env/bin/activate        

Note: After activation, your terminal prompt will be prefixed with (titanic_env).

Deactivate the Virtual Environment:

deactivate        

C) Installing Necessary Libraries

With your virtual environment activated, install the required Python libraries using pip.

Upgrade pip:

python.exe -m pip install --upgrade pip        

Install Pandas and Other Libraries:

pip install pandas numpy matplotlib seaborn        

  • Pandas: For data manipulation and analysis.
  • NumPy: For numerical operations.
  • Matplotlib & Seaborn: For data visualization.

Install SQLite3 (Optional): SQLite comes pre-installed with Python's standard library. However, if you need additional tools or GUIs:

Using SQLite with Python: Python's sqlite3 module lets you interact directly with SQLite databases from your Python scripts.

import sqlite3
# Connect to a database (or create it)
conn = sqlite3.connect('titanic.db')
cursor = conn.cursor()        

Loading the Titanic Dataset

The Titanic dataset is a classic dataset in data science. It contains information about passengers aboard the Titanic, including whether they survived the tragic sinking. We'll use this dataset to perform our SQL and Pandas queries.

Using Python (Pandas)

Import Pandas:

import pandas as pd        

Load the Dataset:

# Load the dataset from a CSV file
titanic_passengers = pd.read_csv('titanic.csv')        

Notes:

  • Ensure that the titanic.csv file is in the same directory as your Python script, or specify the full file path.
  • If you're using Jupyter Notebook or another IDE, the working directory is typically the folder where the script resides.

Explore the Dataset:

# Display the first few rows
print(titanic_passengers.head())

# Get summary statistics
print(titanic_passengers.describe())

# Check for missing values
print(titanic_passengers.isnull().sum())        

Using SQL

Initialize SQLite Database:

If you haven't already, create a new SQLite database and a table to hold the Titanic data.

import sqlite3
import pandas as pd

# Connect to SQLite database
conn = sqlite3.connect('titanic.db')
cursor = conn.cursor()

# Create table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS titanic_passengers (
        PassengerId INTEGER PRIMARY KEY,
        Survived TEXT,
        Pclass INTEGER,
        Name TEXT,
        Sex TEXT,
        Age REAL,
        SibSp INTEGER,
        Parch INTEGER,
        Ticket TEXT,
        Fare REAL,
        Cabin TEXT,
        Embarked TEXT
    )
''')
conn.commit()        

Import Data into SQLite:

# Load data using Pandas
titanic_passengers = pd.read_csv('titanic.csv')

# Insert data into SQLite table
titanic_passengers.to_sql('titanic_passengers', conn, if_exists='replace', index=False)        

Verify Data Import:

# Execute a simple query
cursor.execute('SELECT * FROM titanic_passengers LIMIT 5;')
rows = cursor.fetchall()
for row in rows:
    print(row)        

Output:

(343, 'No', 2, 'Collander, Mr. Erik Gustaf', 'male', 28.0, 0, 0, '248740', 13.0, None, 'S')

(76, 'No', 3, 'Moen, Mr. Sigurd Hansen', 'male', 25.0, 0, 0, '348123', 7.65, 'F G73', 'S')

(641, 'No', 3, 'Jensen, Mr. Hans Peder', 'male', 20.0, 0, 0, '350050', 7.8542, None, 'S')

(568, 'No', 3, 'Palsson, Mrs. Nils (Alma Cornelia Berglund)', 'female', 29.0, 0, 4, '349909', 21.075, None, 'S')

(672, 'No', 1, 'Davidson, Mr. Thornton', 'male', 31.0, 1, 0, 'F.C. 12750', 52.0, 'B71', 'S')

?

Alternatively, using a SQLite client:

  1. Use tools like DB Browser for SQLite to inspect and query your database visually.

Basic Data Queries: SQL vs Pandas

In this section, we'll perform a series of common data operations using both SQL and Python Pandas. For each operation, we'll present the SQL statement followed by its Pandas equivalent.

1. Limiting Rows

Limiting the number of rows returned by a query is useful for previewing data or debugging.

SQL:

SELECT * FROM titanic_passengers LIMIT 5;
SELECT * FROM titanic_passengers LIMIT 20;        

Python (Pandas):

# First 5 rows
query = titanic_passengers.head()

# First 20 rows
query01 = titanic_passengers.head(20)        

Explanation:

  • In SQL, LIMIT restricts the number of rows returned.
  • In Pandas, head() serves the same purpose, with an optional parameter to specify the number of rows.

2. Selecting Specific Columns

Selecting specific columns allows you to focus on relevant data attributes.

Example A

SQL:

SELECT Name, Pclass, Survived FROM titanic_passengers;        

Python (Pandas):

query02 = titanic_passengers[['Name', 'Pclass', 'Survived']]
print(query02)        

Example B

SQL:

SELECT PassengerId, Name, Age FROM titanic_passengers;        

Python (Pandas):

query03 = titanic_passengers[['PassengerId', 'Name', 'Age']]        

Explanation:

  • In SQL, specify the column names after SELECT.
  • In Pandas, pass a list of column names to the DataFrame indexing operator [].

3. Handling Unique Values

Retrieving unique values from a column is essential for understanding categorical data.

SQL:

SELECT DISTINCT Pclass FROM titanic_passengers;        

Execute in Python Environment:

?import sqlite3

# Connect to SQLite database
conn = sqlite3.connect('titanic.db')
cursor = conn.cursor()

# Execute a simple query
cursor.execute('''
SELECT Name, Sex, Age, Pclass, Fare
FROM titanic_passengers
WHERE Fare < 8;
''')
rows = cursor.fetchall()
for row in rows:
    print(row)        

Python (Pandas):

?# Using drop_duplicates()
query04 = titanic_passengers['Pclass'].drop_duplicates().reset_index(drop=True)

# Optimized approach using unique()
query04_optimized = titanic_passengers['Pclass'].unique()        

Explanation:

  • In SQL, DISTINCT eliminates duplicate values.
  • In Pandas, drop_duplicates() or unique() achieves the same result. unique() is more efficient for retrieving unique values as a NumPy array.

?

4. Filtering Data Based on Criteria

Filtering data based on specific conditions helps in isolating subsets of data for analysis.

Example A: Fare < 8

SQL:

SELECT Name, Sex, Age, Pclass, Fare
FROM titanic_passengers
WHERE Fare < 8;        

Python (Pandas):

query05 = titanic_passengers.loc[titanic_passengers['Fare'] < 8, ['Name', 'Sex', 'Age', 'Pclass', 'Fare']]        

?Example B: Fare > 250 and Embarked in ('S', 'C')

SQL:

SELECT Name, Age, Pclass, Fare, Embarked
FROM titanic_passengers
WHERE Fare > 250 AND (Embarked = 'S' OR Embarked = 'C');        

Python (Pandas):

query06 = titanic_passengers.loc[
    (titanic_passengers['Fare'] > 250) &
    (titanic_passengers['Embarked'].isin(['S', 'C'])),
    ['Name', 'Age', 'Pclass', 'Fare', 'Embarked']
]        

Explanation:

  • In SQL, the WHERE clause specifies the filtering conditions.
  • In Pandas, boolean indexing is used within loc[] to filter rows, and the second argument specifies the columns to select.
  • Using isin() enhances readability and efficiency when checking for multiple values.

?

5. Filtering with Substrings

Filtering rows based on substring matches within a column is useful for text-based queries.

SQL:

SELECT Name, Age, Pclass, SibSp, Parch
FROM titanic_passengers
WHERE Name LIKE '%Fortune%';        

Python (Pandas):

query07 = titanic_passengers.loc[
    titanic_passengers['Name'].str.contains('Fortune', case=False, na=False),
    ['Name', 'Age', 'Pclass', 'SibSp', 'Parch']
]        

Explanation:

  • In SQL, LIKE '%Fortune%' searches for the substring "Fortune" anywhere in the Name column.
  • In Pandas, str.contains() performs a similar operation. The parameters case=False make the search case-insensitive, and na=False ensures that NaN values are treated as False.

?

6. Aggregate Functions

Aggregate functions perform calculations on a set of values to return a single value.

Example A: Maximum Fare

SQL:

SELECT MAX(Fare) AS max_fare FROM titanic_passengers;        

Python (Pandas):

query08 = titanic_passengers['Fare'].max()        

Example B: Count of Survived Passengers

SQL:

SELECT COUNT(PassengerId) AS survived_passengers
FROM titanic_passengers
WHERE Survived = 'Yes';        

Python (Pandas):

query09 = titanic_passengers.loc[titanic_passengers['Survived'] == 'Yes', 'PassengerId'].count()        

Example C: Average Fare by Passenger Class

SQL:

SELECT Pclass AS passenger_class, AVG(Fare) AS average_price
FROM titanic_passengers
GROUP BY Pclass;        

Python (Pandas):

query10 = titanic_passengers.groupby('Pclass')['Fare'].mean().reset_index(name='average_price').rename(columns={'Pclass': 'passenger_class'})        

Explanation:

  • In SQL, aggregate functions like MAX, COUNT, and AVG perform calculations over groups of rows defined by GROUP BY.
  • In Pandas, methods like max(), count(), and mean() are used after grouping data with groupby().

?

7. Sorting Data

Sorting data helps organize and rank information based on specific columns.

SQL:

SELECT Survived, Name, Age, Sex
FROM titanic_passengers
WHERE Survived = 'Yes'
ORDER BY Age DESC
LIMIT 10;        

Python (Pandas):

query11 = titanic_passengers.loc[
    titanic_passengers['Survived'] == 'Yes',
    ['Survived', 'Name', 'Age', 'Sex']
].sort_values(by='Age', ascending=False).head(10)        

Explanation:

  • In SQL, ORDER BY sorts the data, and LIMIT restricts the number of rows.
  • In Pandas, sort_values() sorts the DataFrame, and head() retrieves the top rows.

?

8. Handling Missing Values

Handling missing data is crucial for accurate analysis.

Counting Missing Ages

SQL:

SELECT COUNT(PassengerId) AS no_age_passengers
FROM titanic_passengers
WHERE Age IS NULL;        

Python (Pandas):

# Method 1: Using isnull() and sum()
query12 = titanic_passengers['Age'].isnull().sum()

# Method 2: Using notnull() and len()
query12a = len(titanic_passengers) - titanic_passengers['Age'].notnull().sum()        

Identifying Columns with Missing Values

Python (Pandas):

missing_values = titanic_passengers.isnull().sum()
print(missing_values)        

Output:

PassengerId0
Survived         0
Pclass             0
Name             0
Sex                 0
Age            177
SibSp              0
Parch               0
Ticket              0
Fare                 0
Cabin           687
Embarked        2

dtype: int64        

Explanation:

  • In SQL, IS NULL checks for NULL values.
  • In Pandas, isnull() identifies NaN values. Summing these gives the count of missing entries.
  • The second method (query12a) calculates missing values by subtracting the count of non-null entries from the total length.

?

Practice Queries

Now that we've covered basic queries, let's apply these concepts with more complex practice queries.

I. Count of Passengers Embarked from Each Port

Ports:

  • C = Cherbourg
  • S = Southampton
  • Q = Queenstown

SQL:

SELECT Embarked, COUNT(*) AS passenger_count
FROM titanic_passengers
WHERE Embarked IN ('C', 'S', 'Q')
GROUP BY Embarked
ORDER BY passenger_count DESC;        

Python (Pandas):

practice_query01 = (
    titanic_passengers
    .loc[titanic_passengers['Embarked'].isin(['C', 'S', 'Q']), ['Embarked', 'PassengerId']]
    .groupby('Embarked')['PassengerId']
    .count()
    .reset_index(name='passenger_count')
    .sort_values('passenger_count', ascending=False)
)
print(practice_query01)        

Explanation:

  • The WHERE clause in SQL filters for specific embarkation points.
  • In Pandas, isin() filters the Embarked column for the desired values. The subsequent groupby() and count() operations aggregate the data similarly to SQL.

?

II. Average Age of Men & Women in Each Passenger Class

SQL:

SELECT Pclass, Sex, AVG(Age) AS average_age
FROM titanic_passengers
GROUP BY Pclass, Sex
ORDER BY Pclass ASC;        

Python (Pandas):

practice_query02 = (
    titanic_passengers
    .groupby(['Pclass', 'Sex'])['Age']
    .mean()
    .reset_index(name='average_age')
    .sort_values('Pclass')
)
print(practice_query02)        

Explanation:

  • Both SQL and Pandas perform a grouped aggregation to compute the average age for each combination of passenger class and sex.

?

III. Top 20 Passengers Who Paid the Highest Fares

SQL:

SELECT Name, Fare
FROM titanic_passengers
ORDER BY Fare DESC
LIMIT 20;        

Python (Pandas):

practice_query03 = (
    titanic_passengers
    .loc[:, ['Name', 'Fare']]
    .sort_values('Fare', ascending=False)
    .head(20)
)
print(practice_query03)        

Explanation:

  • Sorting the Fare column in descending order and retrieving the top 20 entries in both SQL and Pandas achieves the desired result.

?

IV. Average Age of 3rd Class Non-Surviving Men

SQL:

SELECT AVG(Age) AS average_age
FROM titanic_passengers
WHERE Sex = 'male' AND Survived = 'No' AND Pclass = 3;        

Python (Pandas):

practice_query04 = (
    titanic_passengers
    .loc[
        (titanic_passengers['Sex'].str.lower() == 'male') &
        (titanic_passengers['Survived'].str.lower() == 'no') &
        (titanic_passengers['Pclass'] == 3),
        'Age'
    ]
    .mean()
)
print(practice_query04)        

Explanation:

  • The WHERE clause in SQL filters based on multiple conditions.
  • In Pandas, multiple conditions are combined using the & operator, and str.lower() ensures case-insensitive matching for the Survived column.

?

V. Average Age of 1st Class Survived Women

SQL:

SELECT AVG(Age) AS average_age
FROM titanic_passengers
WHERE Sex = 'female' AND Survived = 'Yes' AND Pclass = 1;        

Python (Pandas):

practice_query05 = (
    titanic_passengers
    .loc[
        (titanic_passengers['Sex'].str.lower() == 'female') &
        (titanic_passengers['Survived'].str.lower() == 'yes') &
        (titanic_passengers['Pclass'] == 1),
        'Age'
    ]
    .mean()
)
print(practice_query05)        

Explanation:

  • Similar to the previous query, but with different filtering conditions to target 1st class women who survived.

Best Practices

Adhering to best practices ensures that your data analysis is efficient, readable, and maintainable. Below are some best practices for both Python Pandas and SQL.

A) Python Pandas Best Practices

Import Libraries Efficiently:

import pandas as pd
import numpy as np        

Use Meaningful Variable Names:

  • Avoid vague names like df1, temp.
  • Example: Use titanic_passengers instead of df.

Chain Methods for Readability:

result = (
    titanic_passengers
    .loc[titanic_passengers['Fare'] > 100]
    .groupby('Pclass')['Fare']
    .mean()
    .reset_index()
)        

Handle Missing Data Appropriately:

  • Use dropna() to remove missing values when necessary.
  • Use fillna() to impute missing values with meaningful statistics.

# Drop rows with missing Age
titanic_clean = titanic_passengers.dropna(subset=['Age'])

# Fill missing Age with the median age
titanic_passengers['Age'].fillna(titanic_passengers['Age'].median(), inplace=True)        

Vectorize Operations:

  • Avoid using loops for data manipulation; leverage Pandas' vectorized operations for speed and efficiency.

# Instead of looping to create a new column

titanic_passengers['AgeGroup'] = titanic_passengers['Age'].apply(lambda x: 'Adult' if x >= 18 else 'Minor')        

Use .copy() When Slicing DataFrames:

  • Prevent unintended modifications by creating copies of sliced DataFrames.

subset = titanic_passengers.loc[titanic_passengers['Pclass'] == 1].copy()        

Leverage Built-in Functions:

  • Use built-in Pandas functions like describe(), info(), and value_counts() for quick data insights.

# Get counts of each passenger class
class_counts = titanic_passengers['Pclass'].value_counts()        

Comment and Document Your Code:

  • Use comments to explain non-obvious parts of your code.
  • Consider using docstrings for functions.

def calculate_average_fare(df):
    """
    Calculates the average fare per passenger class.
    Parameters:
        df (DataFrame): The Titanic passengers DataFrame.
    Returns:
        DataFrame: A DataFrame with passenger class and average fare.
    """
    return df.groupby('Pclass')['Fare'].mean().reset_index(name='average_fare')        


B) SQL Best Practices

Use Aliases for Readability:

SELECT Pclass AS passenger_class, AVG(Fare) AS average_fare
FROM titanic_passengers
GROUP BY Pclass;        

Format Queries for Clarity:

  • Use uppercase for SQL keywords.
  • Indent clauses to enhance readability.

SELECT
    Pclass AS passenger_class,
    AVG(Fare) AS average_fare
FROM
    titanic_passengers
GROUP BY
    Pclass;        

Avoid Using SELECT *:

  • Specify only the columns you need to reduce data retrieval overhead.

SELECT Name, Age, Fare FROM titanic_passengers;        

Use Proper Indexing:

  • Index columns that are frequently used in WHERE, JOIN, and ORDER BY clauses to speed up queries.

CREATE INDEX idx_survived ON titanic_passengers(Survived);        

Limit the Use of Subqueries:

  • Where possible, use JOIN operations instead of subqueries for better performance.

-- Instead of a subquery
SELECT Name, Age
FROM titanic_passengers
WHERE Pclass = (SELECT MIN(Pclass) FROM titanic_passengers);

-- Use JOIN
SELECT tp.Name, tp.Age
FROM titanic_passengers tp
JOIN (SELECT MIN(Pclass) AS min_pclass FROM titanic_passengers) sub
    ON tp.Pclass = sub.min_pclass;        

Handle NULLs Appropriately:

  • Use IS NULL or IS NOT NULL to manage missing data.
  • Consider default values or data cleaning to handle NULL values.

SELECT Name, Age
FROM titanic_passengers
WHERE Age IS NOT NULL;        

Use Transactions for Data Integrity:

  • Ensure that a series of operations either all succeed or fail together to maintain data integrity.

BEGIN TRANSACTION;

-- Insert a new passenger
INSERT INTO titanic_passengers (PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked)
VALUES (892, 'No', 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, '330911', 7.8292, NULL, 'Q');

-- Update fare for 1st class passengers
UPDATE titanic_passengers
SET Fare = Fare * 1.1
WHERE Pclass = 1;

COMMIT;        

Comment Complex Queries:

  • Use comments to explain the purpose and logic of complex queries.

-- Calculate the average fare per passenger class
SELECT Pclass, AVG(Fare) AS average_fare
FROM titanic_passengers
GROUP BY Pclass;        


In this tutorial, we've delved into the fundamental data analysis capabilities of both SQL and Python Pandas using the Titanic dataset. By comparing equivalent operations in both tools, we've highlighted their respective strengths and demonstrated how they can complement each other in a data analyst's toolkit.

Key Takeaways:

  • SQL is unparalleled for querying and managing data within relational databases. Its declarative nature allows for efficient data retrieval and manipulation, especially with large datasets.
  • Python Pandas offers flexibility and a rich set of functions for in-memory data manipulation, making it ideal for exploratory data analysis, data cleaning, and more complex transformations.
  • Understanding both SQL and Pandas empowers you to handle data in various environments, from traditional databases to modern data science workflows.
  • Adhering to best practices in both SQL and Pandas ensures your data analysis is efficient, readable, and maintainable.


Additional Resources

To further enhance your understanding of SQL and Python Pandas, consider exploring the following resources:

SQL Resources:

  • SQL Tutorial by W3Schools: A comprehensive guide covering SQL basics to advanced topics.

https://www.w3schools.com/sql/

  • LeetCode SQL Problems: Practice SQL queries with real-world scenarios.

https://leetcode.com/discuss/general-discussion/1208129/list-of-free-leetcode-sql-questions

  • Books: "SQL in 10 Minutes, Sams Teach Yourself" by Ben Forta

https://www.amazon.co.uk/SQL-Minutes-Sams-Teach-Yourself/dp/0672336073

  • "Learning SQL" by Alan Beaulieu

https://www.amazon.co.uk/Learning-SQL-Alan-Beaulieu/dp/0596520832

Python Pandas Resources

  • Pandas Official Documentation: Detailed documentation with examples and usage guidelines.

https://pandas.pydata.org/docs/

  • Pandas Tutorial by DataCamp: Interactive tutorials to practice Pandas operations.

https://www.datacamp.com/tutorial/pandas

  • Books: "Python for Data Analysis" by Wes McKinney (Creator of Pandas)

https://www.amazon.co.uk/Python-Data-Analysis-Wes-Mckinney/dp/1491957662

  • "Pandas Cookbook" by Theodore Petrou

https://www.amazon.co.uk/Pandas-Cookbook-Scientific-Computing-Visualization/dp/1784393878

?

Combined SQL and Pandas Resources

  • Kaggle Datasets and Notebooks: Explore a variety of datasets and see how others utilize SQL and Pandas for data analysis.

https://www.kaggle.com/


Happy data analyzing!

?


Neven Dujmovic, October 2024



#Python #PythonPandas #SQL #DataScience #TitanicDataset #DataAnalysis #LearnSQL #PythonProgramming #AnalyticsTools #DataManipulation

要查看或添加评论,请登录

社区洞察

其他会员也浏览了