How to Create a Python Image Scraper: A Comprehensive Step-by-Step Tutorial
Jarrel Thomas
Email Security Analyst | Security+ | ISC2 CC | Incident Response // Making the digital world a safer place
Introduction
In the midst of my cybersecurity journey, as most do I find myself with my hands in many projects. One such project was a content moderation project with a colleague. We needed a vast array of images to rigorously test our application’s ability to filter and moderate content effectively. But rather than spending endless hours manually scouring the internet for images, I took the route every savvy IT professional would choose... I automated the process! This not only saved time but also added a powerful new tool to my arsenal. Here’s how I did it—and how you can too.
In this guide, you'll learn how to build an image scraper using Python. This script will allow you to fetch images from a web search based on a query and save them to your local machine. We'll walk through the code, address common errors you might encounter, and explain the rationale behind the choices made in the script.
Requirements
Before starting, make sure you have the following installed:
You can install the required libraries using pip:
pip install requests beautifulsoup4 pillow
Step 1: Setting Up the Script
The initial part of the script would be to begin by importing all the necessary libraries:
import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import urljoin
from PIL import Image
from io import BytesIO
import mimetypes
Why These Libraries?
Step 2: Fetching Image URLs
Next, we need to create a function that will fetch image URLs based on a search query:
def fetch_images(query, num_images=500):
url = f"https://www.bing.com/images/search?q={query.replace(' ', '+')}"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
img_urls = []
for img_tag in soup.find_all("img"):
img_url = img_tag.get("src")
if img_url and not img_url.startswith("data:"):
img_url = urljoin(url, img_url)
img_urls.append(img_url)
if len(img_urls) >= num_images:
break
save_images(img_urls, query)
Breakdown
The "fetch_images" function is designed to automate the process of searching and downloading images from a search engine (Bing in this case) based on a given query. The function starts by constructing a search URL, where the query is formatted for a web search by replacing spaces with +. It then sends an HTTP GET request to retrieve the HTML content of the search results page, where then the HTML is parsed using the BeautifulSoup libary.
Within the HTML, the function locates all images with the tags (<img>) and extracts their src attributes, which contain the image URLs. It will automatically filter out invalid or incomplete URLs, such as those starting with "data:", and ensure that only relative URLs are converted to absolute ones using the urljoin method. These valid image URLs are then stored in a list until the desired number of images is reached.
Finally, the function calls the "save_images" function, which handles downloading and saving the images to your local file system. This process efficiently gathers the specified number of images for the given query, making it ideal for tasks requiring bulk image collection.
Step 3: Saving the Images
The following function saves the images to your local machine:
领英推荐
def save_images(img_urls, query):
directory = f"images/{query}"
if not os.path.exists(directory):
os.makedirs(directory)
for i, img_url in enumerate(img_urls):
try:
img_data = requests.get(img_url).content
img = Image.open(BytesIO(img_data))
img_format = img.format.lower()
if img_format not in ['jpeg', 'png', 'gif']:
content_type = requests.head(img_url).headers.get("content-type")
img_format = mimetypes.guess_extension(content_type.split(";")[0]).lstrip(".")
img.save(f"{directory}/{query}_{i}.{img_format}")
print(f"Saved image {i+1} as {query}_{i}.{img_format}")
except Exception as e:
print(f"Could not save {img_url}: {e}")
Breakdown
The "save_images" function is responsible for organizing and saving the images downloaded from the web. It first checks if a directory exists for the specified query, if one does not exists it creates one. This ensures that all images related to a particular query are stored in a dedicated folder, keeping your files well-organized.
For each image URL in the list, the function downloads the image data and opens it using the Python Imaging Library (PIL). The format of the image is identified and converted to lowercase for consistency. If the image format is unrecognized, the function attempts to determine it using the content type from the image's HTTP headers. Once the format is confirmed, the image is saved to the designated directory with a unique filename.
The function also includes error handling, where any issues encountered during the download or saving process are caught and reported. This ensures that the script continues running even if some images can't be saved, providing robust and reliable performance when dealing with large sets of images.
Step 4: Running the Script
To run the script, use the following block:
if __name__ == "__main__":
query = "puppies"
fetch_images(query, num_images=500)
Breakdown
The if name == "__main__": block serves as the entry point for the script, ensuring that the code within it is only executed when the script is run directly, not when it's imported as a module. Inside this block, a specific query (in this case, "puppies") is defined, and the fetch_images function is called with this query to start the process of gathering and saving images. This structure allows the script to be flexible; you can easily change the query or adjust the number of images to be downloaded without altering the core functions. It provides a clear starting point for users to customize the script for their own needs.
Common Errors and Troubleshooting
As with any programming project there were a number of issues and bugs along the way in developing this script. Here are some of the most common errors I encountered and how I resolved them
Error: ModuleNotFoundError: No module named 'bs4'
Error: ModuleNotFoundError: No module named 'requests'
Error: ModuleNotFoundError: No module named 'PIL'
Error: requests.exceptions.ConnectTimeout: HTTPSConnectionPool
Final Thoughts
This image scraper script was a great hands-on project for any beginner coder looking for something to do over the weekend. It also serves as a practical tool that combines web scraping, data handling, and file management in Python, that can then be used in future projects. By following this guide, I hoped you've gained some valuable insight into the libraries used and your own python knowledge.
Feel free to modify and expand the script to suit your needs. Happy Coding!