登录查看更多内容

How to Create a Python Image Scraper: A Comprehensive Step-by-Step Tutorial

Jarrel Thomas

Email Security Analyst | Security+ | ISC2 CC | Incident Response // Making the digital world a safer place

发布日期: 2024年8月30日

Introduction

In the midst of my cybersecurity journey, as most do I find myself with my hands in many projects. One such project was a content moderation project with a colleague. We needed a vast array of images to rigorously test our application’s ability to filter and moderate content effectively. But rather than spending endless hours manually scouring the internet for images, I took the route every savvy IT professional would choose... I automated the process! This not only saved time but also added a powerful new tool to my arsenal. Here’s how I did it—and how you can too.

In this guide, you'll learn how to build an image scraper using Python. This script will allow you to fetch images from a web search based on a query and save them to your local machine. We'll walk through the code, address common errors you might encounter, and explain the rationale behind the choices made in the script.

Requirements

Before starting, make sure you have the following installed:

Python 3.x
requests library
beautifulsoup4 library
Pillow library

You can install the required libraries using pip:

pip install requests beautifulsoup4 pillow

Step 1: Setting Up the Script

The initial part of the script would be to begin by importing all the necessary libraries:

import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import urljoin
from PIL import Image
from io import BytesIO
import mimetypes

Why These Libraries?

requests: Used for sending HTTP requests to fetch HTML content and image data.
BeautifulSoup: Parses HTML to extract data, such as image URLs.
os: Manages directories where images will be saved.
urljoin: Helps combine relative and base URLs to form absolute URLs.
Pillow: Handles image data for saving and format detection.
BytesIO: Simulates a file-like object for handling image data in memory.
mimetypes: Helps determine file extensions based on MIME types.

Step 2: Fetching Image URLs

Next, we need to create a function that will fetch image URLs based on a search query:

def fetch_images(query, num_images=500):
    url = f"https://www.bing.com/images/search?q={query.replace(' ', '+')}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    img_urls = []
    for img_tag in soup.find_all("img"):
        img_url = img_tag.get("src")
        if img_url and not img_url.startswith("data:"):
            img_url = urljoin(url, img_url)
            img_urls.append(img_url)
        if len(img_urls) >= num_images:
            break

    save_images(img_urls, query)

Breakdown

The "fetch_images" function is designed to automate the process of searching and downloading images from a search engine (Bing in this case) based on a given query. The function starts by constructing a search URL, where the query is formatted for a web search by replacing spaces with +. It then sends an HTTP GET request to retrieve the HTML content of the search results page, where then the HTML is parsed using the BeautifulSoup libary.

Within the HTML, the function locates all images with the tags (<img>) and extracts their src attributes, which contain the image URLs. It will automatically filter out invalid or incomplete URLs, such as those starting with "data:", and ensure that only relative URLs are converted to absolute ones using the urljoin method. These valid image URLs are then stored in a list until the desired number of images is reached.

Finally, the function calls the "save_images" function, which handles downloading and saving the images to your local file system. This process efficiently gathers the specified number of images for the given query, making it ideal for tasks requiring bulk image collection.

Step 3: Saving the Images

The following function saves the images to your local machine:

领英推荐

Python Interview Questions Set 6

Gamaka AI 7 个月前

What is string formatting in python?- A Beginner's…

Learnbay 2 年前

Creating Discord Bots with Python: A Step-by-Step Guide

Hadi E-Learning 2 个月前

def save_images(img_urls, query):
    directory = f"images/{query}"
    if not os.path.exists(directory):
        os.makedirs(directory)

    for i, img_url in enumerate(img_urls):
        try:
            img_data = requests.get(img_url).content

            img = Image.open(BytesIO(img_data))
            img_format = img.format.lower()

            if img_format not in ['jpeg', 'png', 'gif']:
                content_type = requests.head(img_url).headers.get("content-type")
                img_format = mimetypes.guess_extension(content_type.split(";")[0]).lstrip(".")

            img.save(f"{directory}/{query}_{i}.{img_format}")
            print(f"Saved image {i+1} as {query}_{i}.{img_format}")
        except Exception as e:
            print(f"Could not save {img_url}: {e}")

Breakdown

The "save_images" function is responsible for organizing and saving the images downloaded from the web. It first checks if a directory exists for the specified query, if one does not exists it creates one. This ensures that all images related to a particular query are stored in a dedicated folder, keeping your files well-organized.

For each image URL in the list, the function downloads the image data and opens it using the Python Imaging Library (PIL). The format of the image is identified and converted to lowercase for consistency. If the image format is unrecognized, the function attempts to determine it using the content type from the image's HTTP headers. Once the format is confirmed, the image is saved to the designated directory with a unique filename.

The function also includes error handling, where any issues encountered during the download or saving process are caught and reported. This ensures that the script continues running even if some images can't be saved, providing robust and reliable performance when dealing with large sets of images.

Step 4: Running the Script

To run the script, use the following block:

if __name__ == "__main__":
    query = "puppies"
    fetch_images(query, num_images=500)

Breakdown

The if name == "__main__": block serves as the entry point for the script, ensuring that the code within it is only executed when the script is run directly, not when it's imported as a module. Inside this block, a specific query (in this case, "puppies") is defined, and the fetch_images function is called with this query to start the process of gathering and saving images. This structure allows the script to be flexible; you can easily change the query or adjust the number of images to be downloaded without altering the core functions. It provides a clear starting point for users to customize the script for their own needs.

Common Errors and Troubleshooting

As with any programming project there were a number of issues and bugs along the way in developing this script. Here are some of the most common errors I encountered and how I resolved them

Error: ModuleNotFoundError: No module named 'bs4'

Cause: The beautifulsoup4 library is not installed.
Solution: Install it using pip install beautifulsoup4.

Error: ModuleNotFoundError: No module named 'requests'

Cause: The requests library is not installed.
Solution: Install it using pip install requests.

Error: ModuleNotFoundError: No module named 'PIL'

Cause: The Pillow library is not installed.
Solution: Install it using pip install pillow.

Error: requests.exceptions.ConnectTimeout: HTTPSConnectionPool

Cause: The connection to the server timed out, possibly due to network issues.
Solution: URL path did not exists

Final Thoughts

This image scraper script was a great hands-on project for any beginner coder looking for something to do over the weekend. It also serves as a practical tool that combines web scraping, data handling, and file management in Python, that can then be used in future projects. By following this guide, I hoped you've gained some valuable insight into the libraries used and your own python knowledge.

Feel free to modify and expand the script to suit your needs. Happy Coding!

要查看或添加评论，请登录

Jarrel Thomas的更多文章

The Cyber Kill Chain at Work: Understanding Phishing Threat Progression

2025年2月6日

The Cyber Kill Chain at Work: Understanding Phishing Threat Progression

Despite what you hear in the news or see on social media, cyberattacks don’t happen overnight. Movies like Hackers or…
Don’t Take the Bait: A Step-by-Step Guide to Spotting Phishing Emails

2025年1月20日

Don’t Take the Bait: A Step-by-Step Guide to Spotting Phishing Emails

According to a study conducted by Norton Antivirus over 75% of targeted cyberattacks started with an email in 2024…
I created a Simple Elastic SIEM

2025年1月15日

I created a Simple Elastic SIEM

I had a little bit of extra time over the weekend and so I thought I would run through creating a SIEM from scratch to…
The Human Factor in Email Security: Why Awareness Training Still Matters

2025年1月7日

The Human Factor in Email Security: Why Awareness Training Still Matters

Introduction Meet Alice. She’s a bright, hardworking professional who landed her dream job just a year ago.

1 条评论
Mastering Email Marketing During the Holiday Season: Unsubscribe Best Practices and Customer Retention Strategies

2023年11月16日

Mastering Email Marketing During the Holiday Season: Unsubscribe Best Practices and Customer Retention Strategies

Introduction: The holiday season is a crucial time for email marketers to connect with their audience. It can provide…
Boosting Email Deliverability: A Guide to SPF, DKIM, and DMARC for Growing Enterprises

2023年8月7日

Boosting Email Deliverability: A Guide to SPF, DKIM, and DMARC for Growing Enterprises

Even with today's competitive digital communication landscape, effective email communication is paramount for growing…
Black History Month Email Engagement Made Simple: What You Need to Know

2023年2月7日

Black History Month Email Engagement Made Simple: What You Need to Know

Black History Month is an important time for businesses to celebrate and acknowledge the contributions of the Black…
The Simple Hack That Gets Everyone

2021年3月3日

The Simple Hack That Gets Everyone

In my studies of cybersecurity and hacking, I have seen numerous cybercriminals used some of the most sophisticated…
Why LinkedIn is Essential for the Job Market

2020年10月16日

Why LinkedIn is Essential for the Job Market

You may have heard of LinkedIn as a professional social networking platform and may even have an unmanaged profile on…

See all articles

How to Create a Python Image Scraper: A Comprehensive Step-by-Step Tutorial

Jarrel Thomas

Email Security Analyst | Security+ | ISC2 CC | Incident Response // Making the digital world a safer place

Introduction

Requirements

Step 1: Setting Up the Script

Why These Libraries?

Step 2: Fetching Image URLs

Breakdown

Step 3: Saving the Images

领英推荐

Breakdown

Step 4: Running the Script

Breakdown

Common Errors and Troubleshooting

Error: ModuleNotFoundError: No module named 'bs4'

Error: ModuleNotFoundError: No module named 'requests'

Error: ModuleNotFoundError: No module named 'PIL'

Error: requests.exceptions.ConnectTimeout: HTTPSConnectionPool

Final Thoughts

Jarrel Thomas的更多文章

社区洞察

其他会员也浏览了

Strings in Python - All You Need to Know

Introduction to Strings in Python

File Handling in Python

A Beginner's Guide to Variables in Python

02. Unleashing the Power of Python Strings: From Basics to Advanced Manipulation

3 Cool Python Tricks You Should Know

How to generate OTP in Python?

Python Program Obfuscation Tool

Creating Interactive Map Applications in Python Using the Folium Module

Fine Tuning Your Own Sentence Transformers with Python

Introduction

Requirements

Step 1: Setting Up the Script

Why These Libraries?

Step 2: Fetching Image URLs

Breakdown

Step 3: Saving the Images

领英推荐

Breakdown

Step 4: Running the Script

Breakdown

Common Errors and Troubleshooting

Error: ModuleNotFoundError: No module named 'bs4'

Error: ModuleNotFoundError: No module named 'requests'

Error: ModuleNotFoundError: No module named 'PIL'

Error: requests.exceptions.ConnectTimeout: HTTPSConnectionPool

Final Thoughts

Jarrel Thomas的更多文章

The Cyber Kill Chain at Work: Understanding Phishing Threat Progression

Don’t Take the Bait: A Step-by-Step Guide to Spotting Phishing Emails

I created a Simple Elastic SIEM

The Human Factor in Email Security: Why Awareness Training Still Matters

Mastering Email Marketing During the Holiday Season: Unsubscribe Best Practices and Customer Retention Strategies

Boosting Email Deliverability: A Guide to SPF, DKIM, and DMARC for Growing Enterprises

Black History Month Email Engagement Made Simple: What You Need to Know

The Simple Hack That Gets Everyone

Why LinkedIn is Essential for the Job Market

社区洞察

其他会员也浏览了

Strings in Python - All You Need to Know

Introduction to Strings in Python

File Handling in Python

A Beginner's Guide to Variables in Python

02. Unleashing the Power of Python Strings: From Basics to Advanced Manipulation

3 Cool Python Tricks You Should Know

How to generate OTP in Python?

Python Program Obfuscation Tool

Creating Interactive Map Applications in Python Using the Folium Module

Fine Tuning Your Own Sentence Transformers with Python