登录查看更多内容

Web scraping in Python

Mansoor Ahmed

BSc. at University of Engineering and Technology, Lahore

发布日期: 2022年1月11日

Introduction

Web scraping in?Python?is a process for taking out information from websites. This can be prepared by hand. On the other hand, it is commonly more rapid, well-organized, and less error-prone to automate the assignment.

We can get non-tabular and poorly structured data from websites and then change it into a structured and usable format through web scrapping. Its best example is a CSV file or spreadsheet.

Similarly,?web scrapping can support us documentation data and pathway changes to data online. In this article, we will understand the web scraping technique in depth.

Description

Web scraping is used by various arenas to gather data not effortlessly obtainable in other formats. It is a valuable tool even for just a casual programmer. If we require to check our latest homework assignments on our university page and have them emailed to us.?Web scraping?targets particular information on the pages visited.

Methods of Web Scraping

There are two methods of extracting data from websites.

The Manual extraction method
The automated extraction method

The Manual extraction method

In this method, we manually copy-paste the site content. However, boring, time-taking and tedious it is an operative way to scrap data from the sites having good anti-scraping actions similar to bot detection.

The automated extraction method

Web Scraping is the computerization of the data-taking out the process from websites. This occurrence is completed with the assistance of web scraping software well-known as web scrapers. They automatically load and excerpt data from the websites created on user needs. These may be custom-made to work for one site and can be arranged to work with any website.

Web Scraping Tools

Web Scraping?tools are in detail technologically advanced for extracting data from the internet. These are also known as web harvesting tools and data extraction tools. These tools are beneficial for anyone to gather particular data from websites. Because they offer the user structured data taking out the data from a number of websites. Below are some most standard Web Scraping tools:

io
Scrapinghub
io
Parsehub
io

Python?Web Scraper

We can find a suitable package for web scraping in?Python’s library?urllib.
It comprises tools for working with URLs.
The urllib.request module comprises a function called urlopen().
That may be used to open a URL within a program.
Type the below to import urlopen() in IDLE’s interactive window.

>>> from urllib.request import urlopen

The web page that we’ll open is at the below URL:

>>> url = "https://olympus.realpython.org/profiles/aphrodite"

We can pass?urlto?urlopen() to open the web page.

领英推荐

Text extraction from HTML by using String Methods

We use string methods to extract information from a web page’s HTML.
We may use?.find()to search via the text of the HTML for the?<title>?tags and extract the title of the web page.
We can use a string slice to?extract the title if we see the index of the first character of the title?and the first character of the closing </title> tag.
Meanwhile,?.find()returns the index of the first event of a substring.
We may get the index of the opening?<title>tag by passing the string?“<title>”?to?.find().

>>> 
>>> title_index = html.find("<title>")
>>> title_index

We want the index of the title that one.
We can add the length of the string?“<title>”to?title_index to acquire the index of the first letter in the title.

>>> 
>>> start_index = title_index + len("<title>")
>>> start_index

Here, develop the index of the closing?</title>tag by passing the string?“</title>”?to?.find().

>>> 
>>> end_index = html.find("</title>")
>>> end_index

Lastly, we can extract the title by slicing the?html

>>> 
>>> title = html[start_index:end_index]
>>> title

Below is another profile page with some messier HTML that can scrape:

>>> 
>>> url = "https://olympus.realpython.org/profiles/poseidon"

Extract the title from this new URL using the below method.

>>> 
>>> url = "https://olympus.realpython.org/profiles/poseidon"
>>> page = urlopen(url)
>>> html = page.read().decode("utf-8")
>>> start_index = html.find("<title>") + len("<title>")
>>> end_index = html.find("</title>")
>>> title = html[start_index:end_index]
>>> title

Conclusion

In this article, we learned web scraping and its methods and how to request a web page with Python’s built-in urllib library.
This is fun to write automated web scraping programs.
The Internet has no lack of content that may lead to all kinds of exciting projects.
We must check the website’s terms of use before starting the scrapping.

For more details visit:https://www.technologiesinindustry4.com/2022/01/web-scraping-in-python.html

要查看或添加评论，请登录

Mansoor Ahmed的更多文章

Building a Sustainable Future for the Textile Industry

2023年7月16日

Building a Sustainable Future for the Textile Industry

Introduction The textile industry is one of the largest and most influential sectors in the world, playing a…
Discovering the Potential of Sea-Based Floating Solar Power Plants

2023年7月12日

Discovering the Potential of Sea-Based Floating Solar Power Plants

Introduction: The quest for renewable energy sources has led to remarkable advancements in solar power technology…
The Transformation of Renewable Energy Technologies

2023年7月12日

The Transformation of Renewable Energy Technologies

Introduction In recent years, the global landscape has witnessed a remarkable transformation in the field of renewable…
Twitter vs Meta Threads: The Battle for Online Conversation Dominance

2023年7月7日

Twitter vs Meta Threads: The Battle for Online Conversation Dominance

Introduction In the vast realm of social media, platforms continue to vie for supremacy in capturing the attention and…
Meta Platforms | Social Metaverse Company

2022年11月10日

Meta Platforms | Social Metaverse Company

Introduction Meta Platforms. Inc performing business as Meta and in the past named Facebook, Inc.
Automated Market Maker (AMM) Mechanism

2022年11月1日

Automated Market Maker (AMM) Mechanism

Introduction Automated market makers (AMMs) permit the virtual property to be traded without permission and robotically…
Top Pillars of Industry 4.0

2022年9月29日

Top Pillars of Industry 4.0

Introduction Industry 4.0 is the stylish call particular to the fourth Industrial revolution.
Piecework and Assembly Line Industry 2.0

2022年9月19日

Piecework and Assembly Line Industry 2.0

Introduction The Second Industrial Revolution started in the 19th century over the discovery of electricity and…
Characteristics and Impacts of Industry 4.0

2022年9月14日

Characteristics and Impacts of Industry 4.0

Introduction The waves of the Industry 4.0 model in the global and national economies, specific industries, employment,…
What Are Stable coins?

2022年6月23日

What Are Stable coins?

Introduction A a stable coin is a digital asset that objectives to uphold the same value as a stable asset. The US…

See all articles

Web scraping in Python

Mansoor Ahmed

BSc. at University of Engineering and Technology, Lahore

Introduction

Description

Methods of Web Scraping

Web Scraping Tools

Python?Web Scraper

领英推荐

Text extraction from HTML by using String Methods

Mansoor Ahmed的更多文章

社区洞察

其他会员也浏览了

A Brief Introduction to Web Scraping with Python

12 Exciting Python Projects on Github You Should Try Today [2022]

Deploying Machine Learning Models with Python: Best Practices and Tools

Web Scraping with Python: A Beginner’s Guide

A Guide to Web Scraping with Python

Python Interview Questions Set 6

Developing a Python Script – Geetest CAPTCHA Solver: A Comprehensive Guide to Bypassing Geetest V4 and Other Versions

How Python Simplifies and Optimizes Web Scraping

FastAPI: A Modern Framework for High-Performance APIs

Introduction

Description

Methods of Web Scraping

Web Scraping Tools

Python?Web Scraper

领英推荐

Text extraction from HTML by using String Methods

Mansoor Ahmed的更多文章

Building a Sustainable Future for the Textile Industry

Discovering the Potential of Sea-Based Floating Solar Power Plants

The Transformation of Renewable Energy Technologies

Twitter vs Meta Threads: The Battle for Online Conversation Dominance

Meta Platforms | Social Metaverse Company

Automated Market Maker (AMM) Mechanism

Top Pillars of Industry 4.0

Piecework and Assembly Line Industry 2.0

Characteristics and Impacts of Industry 4.0

What Are Stable coins?

社区洞察

其他会员也浏览了

A Brief Introduction to Web Scraping with Python

12 Exciting Python Projects on Github You Should Try Today [2022]

Deploying Machine Learning Models with Python: Best Practices and Tools

Web Scraping with Python: A Beginner’s Guide

A Guide to Web Scraping with Python

Python Interview Questions Set 6

Developing a Python Script – Geetest CAPTCHA Solver: A Comprehensive Guide to Bypassing Geetest V4 and Other Versions

How Python Simplifies and Optimizes Web Scraping

FastAPI: A Modern Framework for High-Performance APIs