登录查看更多内容

Web scraping project using Docker, Selenium, proxies, login, JavaScript, regex, and Python.

Prashant Patil

发布日期: 2023年1月27日

Create a web scraping project using Docker, Selenium, proxies, login, JavaScript, regex, and Python. Here is an example of how you could set up such a project:

Create a new directory for the project and navigate into it.
Create a file called "requirements.txt" and list the necessary Python packages, such as Selenium, requests, and BeautifulSoup.
Create a Dockerfile and set the base image to python:3.8. In the Dockerfile, copy the requirements.txt file into the container and run "pip install -r requirements.txt" to install the necessary packages.
Create a script called "scraper.py" that uses Selenium and a WebDriver (e.g. Chrome or Firefox) to navigate to the website you want to scrape. You can use the "webdriver.DesiredCapabilities" class to set dynamic user agents and headers.
Use the "webdriver.Proxy" class to set the proxy server, and configure the login, JavaScript and regex.
Use the "find_element_by_" and "find_elements_by_" methods to locate the elements on the page that contain the data you want to scrape. Extract the data using the "text" attribute of the element.
Use BeautifulSoup to parse the HTML page and extract the desired fields using regex expressions.
Use the "time.sleep()" function to slow down the scraping rate and prevent the website from blocking your IP address.
Use the script to run the scrap and save the result in a CSV file.
Finally, use the command "docker build -t scraper ." to build the Docker image, and "docker run scraper" to run the container and execute the script.

Note: This is just a sample example and you should consider the terms of use of the website you want to scrape and the legal aspect of web scraping.

I can provide you with an example of how some of the components might be implemented:

# scraper.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from bs4 import BeautifulSoup
import requests
import re
import time


# Set up proxy
proxy_url = "https://proxy_server:port"
proxy = webdriver.Proxy()
proxy.http_proxy = proxy_url
proxy.add_to_capabilities(webdriver.DesiredCapabilities.CHROME)


# Set up dynamic user agent
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
d = DesiredCapabilities.CHROME
d['goog:loggingPrefs'] = {'performance': 'ALL'}
d['goog:chromeOptions'] = {'w3c': False}
d['user-agent'] = user_agent


# Create webdriver
driver = webdriver.Chrome(desired_capabilities=d, proxy=proxy)


# Log in to website
driver.get("https://example.com/login")
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("your_username")
password.send_keys("your_password")
driver.find_element_by_id("login_button").click()


# Navigate to website to scrape
driver.get("https://example.com/scrape_me")


# Wait for JavaScript to load
time.sleep(5)


# Use BeautifulSoup to parse HTML and extract desired fields
soup = BeautifulSoup(driver.page_source, 'html.parser')
fields = soup.find_all("div", class_="field")
for field in fields:
? ? field_text = field.text
? ? match = re.search(r"field_name: (\w+)", field_text)
? ? if match:
? ? ? ? field_name = match.group(1)
? ? ? ? print(field_name)


# Close webdriver
driver.quit()

You could also encapsulate this code in a function and use it in a loop to scrape multiple pages.

The above code is a sample example and you should consider the terms of use of the website you want to scrape and the legal aspect of web scraping. Also, make sure to install the necessary packages and adapt the code to your specific needs.

Please note that using proxy, dynamic user agent and dynamic headers can help you to avoid being detected as a scraper, but it's not a 100% guarantee. Websites can track your IP address and block you even if you are using a proxy and changing the user agent.

Here is an example of a Dockerfile that can be used to build a Docker image for a web scraping project using Selenium and Python:

# Use an official Python runtime as the base imag
FROM python:3.9


# Set the working directory
WORKDIR /app


# Copy the requirements file into the container
COPY requirements.txt .


# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt


# Copy the application code into the container
COPY . .


# Run the command to start the scraping script
CMD ["python", "scraper.py"]

This Dockerfile starts with a base image of Python 3.9, sets the working directory to /app, copies the requirements file and install all the packages, copies the application code into the container, and runs the scraping script with the command python scraper.py.

You will need to have the requirements.txt file in the same directory as the Dockerfile containing all the necessary packages for your script to run, for example:

领英推荐

Book Excerpt - Science of Selenium - Ch-6 WebUI…

Kalilur Rahman 5 年前

Deep Dive into Selenium’s communication setup with…

Sumon Dey 1 年前

Build Web Applications with Python

Code Graphers 1 年前

beautifulsoup4==4.9.
requests==2.24.0
Selenium==3.141.0

You can build the image using the command docker build -t <image_name> . and then you can run the container using the command docker run -it <image_name>.

Please note that you need to have Docker installed in your system to be able to use this file.

Here are the steps for installing Docker on different operating systems:

Windows

Go to the Docker website (https://docs.docker.com/docker-for-windows/install/) and download the installer for Windows.
Run the installer and follow the prompts to install Docker.
Once the installation is complete, open the Docker app from the Start menu.
In the Docker app, sign in with your Docker ID.

macOS

Go to the Docker website (https://docs.docker.com/docker-for-mac/install/) and download the installer for macOS.
Run the installer and follow the prompts to install Docker.
Once the installation is complete, open the Docker app from the Applications folder.
In the Docker app, sign in with your Docker ID.

Linux

The process of installing Docker on Linux depends on the distribution you're using. Here are instructions for some popular distributions:

Ubuntu: https://docs.docker.com/engine/install/ubuntu/
Debian: https://docs.docker.com/engine/install/debian/
CentOS: https://docs.docker.com/engine/install/centos/
Fedora: https://docs.docker.com/engine/install/fedora/
RHEL: https://docs.docker.com/engine/install/rhel/

Please note that you need to have an account in hub.docker.com to be able to download the images and also need to have a stable internet connection.

Kavin Patel

2 年

Good to Know.

1 次回应

要查看或添加评论，请登录

Prashant Patil的更多文章

Empowering AI Agents to Control Your Browser: A Deep Dive into Browser-Automation with browser?use & Web?UI

2025年2月14日

Empowering AI Agents to Control Your Browser: A Deep Dive into Browser-Automation with browser?use & Web?UI

In today’s fast-evolving tech landscape, automation is not just a luxury it’s a necessity. Imagine telling your…
Revolutionizing Real Estate: Harnessing AI-Powered Web Scraping for Unmatched Market Insights

2025年2月12日

Revolutionizing Real Estate: Harnessing AI-Powered Web Scraping for Unmatched Market Insights

Revolutionizing Real Estate: Harnessing AI-Powered Web Scraping for Unmatched Market Insights In today’s…
Run DeepSeek-R1 Locally: A Step-by-Step Guide with Python, Ollama, and Advanced Integrations

2025年1月28日

Run DeepSeek-R1 Locally: A Step-by-Step Guide with Python, Ollama, and Advanced Integrations

Introduction Large Language Models (LLMs) like DeepSeek-R1 are transforming AI, but cloud-based APIs often come with…
AI Development Prompts and Their Responses: A Practical Guide 2024-2025

2024年12月20日

AI Development Prompts and Their Responses: A Practical Guide 2024-2025

Introduction Understanding how AI responds to development prompts is crucial for getting the best results. Let's…
The Ultimate Guide to AI Prompting for Full-Stack Development 2024-2025

2024年12月18日

The Ultimate Guide to AI Prompting for Full-Stack Development 2024-2025

Introduction Effectively prompting AI for development tasks is crucial for getting high-quality, usable code. This…
Building Enterprise-Grade RAG Systems: A Software Architect's Guide to Web Scraping and Vector Search

2024年12月11日

Building Enterprise-Grade RAG Systems: A Software Architect's Guide to Web Scraping and Vector Search

TL;DR for Busy Engineers Implementing production-ready RAG with distributed web scraping Solving real engineering…

3 条评论
Elasticsearch: Revolutionizing Business Growth with Vector Search, RAG, and LLM Integration

2024年12月10日

Elasticsearch: Revolutionizing Business Growth with Vector Search, RAG, and LLM Integration

In today's digital landscape, businesses are drowning in data while customers demand increasingly sophisticated search…
FAISS: The Ultimate Guide to Vector Search - Making AI Search Simple for Everyone ??

2024年12月9日

FAISS: The Ultimate Guide to Vector Search - Making AI Search Simple for Everyone ??

Why Should You Care About FAISS? ?? Imagine trying to find a specific grain of sand on a beach - that's what searching…

2 条评论
Web Scraping Meets Data Science: Unlocking Business Value Through Automated Data Collection

2024年12月7日

Web Scraping Meets Data Science: Unlocking Business Value Through Automated Data Collection

In today's data-driven business landscape, the ability to gather and analyze web data at scale has become a crucial…

1 条评论
Unleashing the Power of ChatGPT in Web Crawling & Automation with Python: A Comprehensive Guide

2024年10月22日

Unleashing the Power of ChatGPT in Web Crawling & Automation with Python: A Comprehensive Guide

In today’s fast-paced world, businesses rely heavily on automation and data extraction for actionable insights. Web…

See all articles

Web scraping project using Docker, Selenium, proxies, login, JavaScript, regex, and Python.

Prashant Patil

领英推荐

Prashant Patil的更多文章

社区洞察

其他会员也浏览了

Creating a chat app with Django Channels

Leading Python Frameworks for Building Web Solutions

Introduction to Django 3.0 - Building, Authenticating, and Deploying

Web scraping SDKs & templates

NetBox Custom Script Development Environment

An Introduction to FastAPI: The Future of Web Frameworks

Reasons why Python's Dictionary is considered the best choice for forming scalable REST API responses

FastAPI: async def vs def. Performance comparison

Top 7 Python Frameworks To Learn

What is Frappe Framework in 7 minutes?

领英推荐

Prashant Patil的更多文章

Empowering AI Agents to Control Your Browser: A Deep Dive into Browser-Automation with browser?use & Web?UI

Revolutionizing Real Estate: Harnessing AI-Powered Web Scraping for Unmatched Market Insights

Run DeepSeek-R1 Locally: A Step-by-Step Guide with Python, Ollama, and Advanced Integrations

AI Development Prompts and Their Responses: A Practical Guide 2024-2025

The Ultimate Guide to AI Prompting for Full-Stack Development 2024-2025

Building Enterprise-Grade RAG Systems: A Software Architect's Guide to Web Scraping and Vector Search

Elasticsearch: Revolutionizing Business Growth with Vector Search, RAG, and LLM Integration

FAISS: The Ultimate Guide to Vector Search - Making AI Search Simple for Everyone ??

Web Scraping Meets Data Science: Unlocking Business Value Through Automated Data Collection

Unleashing the Power of ChatGPT in Web Crawling & Automation with Python: A Comprehensive Guide

社区洞察

其他会员也浏览了

Creating a chat app with Django Channels

Leading Python Frameworks for Building Web Solutions

Introduction to Django 3.0 - Building, Authenticating, and Deploying

Web scraping SDKs & templates

NetBox Custom Script Development Environment

An Introduction to FastAPI: The Future of Web Frameworks

Reasons why Python's Dictionary is considered the best choice for forming scalable REST API responses

FastAPI: async def vs def. Performance comparison

Top 7 Python Frameworks To Learn

What is Frappe Framework in 7 minutes?