Web scraping in Python

Web scraping in Python

Introduction

Web scraping in?Python?is a process for taking out information from websites. This can be prepared by hand. On the other hand, it is commonly more rapid, well-organized, and less error-prone to automate the assignment.

We can get non-tabular and poorly structured data from websites and then change it into a structured and usable format through web scrapping. Its best example is a CSV file or spreadsheet.

Similarly,?web scrapping can support us documentation data and pathway changes to data online. In this article, we will understand the web scraping technique in depth.

Description

Web scraping is used by various arenas to gather data not effortlessly obtainable in other formats. It is a valuable tool even for just a casual programmer. If we require to check our latest homework assignments on our university page and have them emailed to us.?Web scraping?targets particular information on the pages visited.

Methods of Web Scraping

There are two methods of extracting data from websites.

  • The Manual extraction method
  • The automated extraction method

The Manual extraction method

In this method, we manually copy-paste the site content. However, boring, time-taking and tedious it is an operative way to scrap data from the sites having good anti-scraping actions similar to bot detection.

The automated extraction method

Web Scraping is the computerization of the data-taking out the process from websites. This occurrence is completed with the assistance of web scraping software well-known as web scrapers. They automatically load and excerpt data from the websites created on user needs. These may be custom-made to work for one site and can be arranged to work with any website.

Web Scraping Tools


Web Scraping?tools are in detail technologically advanced for extracting data from the internet. These are also known as web harvesting tools and data extraction tools. These tools are beneficial for anyone to gather particular data from websites. Because they offer the user structured data taking out the data from a number of websites. Below are some most standard Web Scraping tools:

  • io
  • Scrapinghub
  • io
  • Parsehub
  • io

Python?Web Scraper

  • We can find a suitable package for web scraping in?Python’s library?urllib.
  • It comprises tools for working with URLs.
  • The urllib.request module comprises a function called urlopen().
  • That may be used to open a URL within a program.
  • Type the below to import urlopen() in IDLE’s interactive window.

>>> from urllib.request import urlopen        

  • The web page that we’ll open is at the below URL:

>>> url = "https://olympus.realpython.org/profiles/aphrodite"        

  • We can pass?urlto?urlopen() to open the web page.

>>> 
>>> page = urlopen(url)        

  • urlopen()returns an?HTTPResponse?object:

>>>

>>> page
<http.client.HTTPResponse object at 0x105fef820>        

  • Initially use the?HTTPResponseobject’s?.read()?method to extract the HTML from the page.
  • That returns a structure of bytes.
  • At that time use?.decode()to decode the bytes to a string using?UTF-8.

>>> 
>>> html_bytes = page.read()
>>> html = html_bytes.decode("utf-8")        

  • We can?print here?the HTML to understand the contents of the web page.

>>> 
>>> print(html)
<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>        

  • We can extract information from the text in a changed way once we have the HTML as text.

Text extraction from HTML by using String Methods

  • We use string methods to extract information from a web page’s HTML.
  • We may use?.find()to search via the text of the HTML for the?<title>?tags and extract the title of the web page.
  • We can use a string slice to?extract the title if we see the index of the first character of the title?and the first character of the closing </title> tag.
  • Meanwhile,?.find()returns the index of the first event of a substring.
  • We may get the index of the opening?<title>tag by passing the string?“<title>”?to?.find().

>>> 
>>> title_index = html.find("<title>")
>>> title_index        

  • We want the index of the title that one.
  • We can add the length of the string?“<title>”to?title_index to acquire the index of the first letter in the title.

>>> 
>>> start_index = title_index + len("<title>")
>>> start_index        

  • Here, develop the index of the closing?</title>tag by passing the string?“</title>”?to?.find().

>>> 
>>> end_index = html.find("</title>")
>>> end_index        

  • Lastly, we can extract the title by slicing the?html

>>> 
>>> title = html[start_index:end_index]
>>> title        

  • Below is another profile page with some messier HTML that can scrape:

>>> 
>>> url = "https://olympus.realpython.org/profiles/poseidon"        

  • Extract the title from this new URL using the below method.

>>> 
>>> url = "https://olympus.realpython.org/profiles/poseidon"
>>> page = urlopen(url)
>>> html = page.read().decode("utf-8")
>>> start_index = html.find("<title>") + len("<title>")
>>> end_index = html.find("</title>")
>>> title = html[start_index:end_index]
>>> title        

Conclusion

  • In this article, we learned web scraping and its methods and how to request a web page with Python’s built-in urllib library.
  • This is fun to write automated web scraping programs.
  • The Internet has no lack of content that may lead to all kinds of exciting projects.
  • We must check the website’s terms of use before starting the scrapping.

For more details visit:https://www.technologiesinindustry4.com/2022/01/web-scraping-in-python.html

要查看或添加评论,请登录

Mansoor Ahmed的更多文章

社区洞察

其他会员也浏览了