Web scraping project using Docker, Selenium, proxies, login, JavaScript, regex, and Python.
Create a web scraping project using Docker, Selenium, proxies, login, JavaScript, regex, and Python. Here is an example of how you could set up such a project:
Note: This is just a sample example and you should consider the terms of use of the website you want to scrape and the legal aspect of web scraping.
I can provide you with an example of how some of the components might be implemented:
# scraper.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from bs4 import BeautifulSoup
import requests
import re
import time
# Set up proxy
proxy_url = "https://proxy_server:port"
proxy = webdriver.Proxy()
proxy.http_proxy = proxy_url
proxy.add_to_capabilities(webdriver.DesiredCapabilities.CHROME)
# Set up dynamic user agent
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
d = DesiredCapabilities.CHROME
d['goog:loggingPrefs'] = {'performance': 'ALL'}
d['goog:chromeOptions'] = {'w3c': False}
d['user-agent'] = user_agent
# Create webdriver
driver = webdriver.Chrome(desired_capabilities=d, proxy=proxy)
# Log in to website
driver.get("https://example.com/login")
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("your_username")
password.send_keys("your_password")
driver.find_element_by_id("login_button").click()
# Navigate to website to scrape
driver.get("https://example.com/scrape_me")
# Wait for JavaScript to load
time.sleep(5)
# Use BeautifulSoup to parse HTML and extract desired fields
soup = BeautifulSoup(driver.page_source, 'html.parser')
fields = soup.find_all("div", class_="field")
for field in fields:
? ? field_text = field.text
? ? match = re.search(r"field_name: (\w+)", field_text)
? ? if match:
? ? ? ? field_name = match.group(1)
? ? ? ? print(field_name)
# Close webdriver
driver.quit()
You could also encapsulate this code in a function and use it in a loop to scrape multiple pages.
The above code is a sample example and you should consider the terms of use of the website you want to scrape and the legal aspect of web scraping. Also, make sure to install the necessary packages and adapt the code to your specific needs.
Please note that using proxy, dynamic user agent and dynamic headers can help you to avoid being detected as a scraper, but it's not a 100% guarantee. Websites can track your IP address and block you even if you are using a proxy and changing the user agent.
Here is an example of a Dockerfile that can be used to build a Docker image for a web scraping project using Selenium and Python:
# Use an official Python runtime as the base imag
FROM python:3.9
# Set the working directory
WORKDIR /app
# Copy the requirements file into the container
COPY requirements.txt .
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
# Copy the application code into the container
COPY . .
# Run the command to start the scraping script
CMD ["python", "scraper.py"]
This Dockerfile starts with a base image of Python 3.9, sets the working directory to /app, copies the requirements file and install all the packages, copies the application code into the container, and runs the scraping script with the command python scraper.py.
You will need to have the requirements.txt file in the same directory as the Dockerfile containing all the necessary packages for your script to run, for example:
领英推荐
beautifulsoup4==4.9.
requests==2.24.0
Selenium==3.141.0
You can build the image using the command docker build -t <image_name> . and then you can run the container using the command docker run -it <image_name>.
Please note that you need to have Docker installed in your system to be able to use this file.
Here are the steps for installing Docker on different operating systems:
Windows
macOS
Linux
The process of installing Docker on Linux depends on the distribution you're using. Here are instructions for some popular distributions:
Please note that you need to have an account in hub.docker.com to be able to download the images and also need to have a stable internet connection.
Sr Automation Specialist I at S&P Global Commodity Insights | Web scraping Expert in Python & C# | Website Scraping | Mobile App Scraping | Web Data Extraction | Proxy Rotating | ASP.NET | Scrapy | Selenium | BS4
2 年Good to Know.