登录查看更多内容

Web Scraping using Regular Expression and GAMBAS

Vinode Singh Ujlain

CTO/Head (R&D) , Systems Engineering , Software , IoT,OT & Cybersecurity,Ind 4.0, M2M , NPD, Electronics / Mechatronics, A&D | IIT Kharagpur, IIM Ahmadabad & Defense Services Staff College. Passionate Pythoneer

发布日期: 2018年11月13日

Web-page when delivered to a browser is pure HTML tags, this also means that each web page is a structured document. Some of the definitions for web scraping are :-

(a) Web scraping is the practice of using a computer program to sift through a web page and gather the data in a useful format while at the same time preserving the structure of the data.

(b) Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

Is Web scraping legal?

Often, websites allow third party scraping. For example, most websites give Google the express or implied permission to index their web pages. Web scraping is clearly legal. A variety of laws may apply to unauthorized scraping, including contract, copyright and trespass etc.

Data scraping in itself may not be illegal, but the data (factual or otherwise) may be subject to copyright. Using it without the owner's permission, especially if you are selling it, could lead to legal action. However , as a general rule, if the information is public on the internet, it is legal to crawl it.

How to use GAMBAS for Web scraping ?

I am not certain how other tools handle web scraping. I however found use of regular expression (REGEX) raw , powerful & potent method to iterate through website & extract links , images & other data. The key here is - the coder of website often is a lazy programmer and uses a template across the site to deliver HTML. Thus , the scheme of crawling web page needs a bit of regular expression writing , hit & try till it fetches what you need. Once this is done , scraping a website & generating a database (or CSV - much easier) takes a mere few seconds. I was developing a desktop application for web radio (using GAMBAS on Linux). I thus needed a database of URLs that stream Radio/Audio. A bit of web scraping helped.

Sample HTML String

<a href='https://www.radioau_try.net/#kiis-101-1' title='KIIS_#101.1'><img class='cover' src='https://cdn.webrad_try.io/images/logos/radioau-net/kiis-101-1.png' alt='KIIS 101.1' height='66' width='96'></a>


step 1
get text between the two <a> to </a> : <a.*/a>


Step 2
get HREF part : (href|HREF)\s*=\s*('|")\s*((http|HTTP)(s|S)*://[a-zA-Z0-9.-/#-\_]*\s*)('|")


Step 3
get Title : title\w{0,}=\w{0,}('|")\w[a-zA-Z0-9 .-]{0,}('|")


Step 4
Get Image Source :(src|SRC)\s*=\s*('|")s*(http|HTTP)(s|S)*://[a-zA-Z0-9.-/#-\_]*\s*('|")

Not going into the detailed GAMABS code on steps followed to iterate the URL (keep diving till all URLs are visited) , this shows sample CSV grab of the data that I could fetch using above scheme.

This is the finished WebRadio application for which I had to web-scrape. Will look to talk about this in another post.

Note : I code since I love doing so. It is ultimate creative thrill to write a working code or simply experiment using a new language or try out an idea that lingered along. Regular Expressions & Web Radio are what I wanted to write for a while.

Anshul U.

Sr. Principal Specialist (Data Science & Generative AI) | Researcher | Speaker | Inventor

6 年

Regular Expression, Xpath and Oxpath are really good for web scraping.

1 次回应

查看更多评论

要查看或添加评论，请登录

Vinode Singh Ujlain的更多文章

Securely send Pub key over socket & reduce MITM risk

2024年10月4日

Securely send Pub key over socket & reduce MITM risk

To securely send a public key over socket and reduce the risk of a Man-in-the-Middle (MITM) attack, one can implement a…

2 条评论
Why Alpha Smooth Filter is Essential for Time Series Data Analysis

2024年6月3日

Why Alpha Smooth Filter is Essential for Time Series Data Analysis

In the realm of time series analysis, the goal is often to extract meaningful insights from data that evolves over…
External Encrypt/Decrypt or use builtin DB engine function, which is faster in execution ?

2021年4月8日

External Encrypt/Decrypt or use builtin DB engine function, which is faster in execution ?

Objective : Given an encryption method , a plain text secret (to be encrypted and stored in DB) and a plain text key…
Use of Stun gun in crime ?

2020年8月13日

Use of Stun gun in crime ?

I am not sure what voltage stun gun delivers. It ought to be a high voltage, short duration pulse.

1 条评论
How to Create Serverless Deployable Microservice using GAMBAS

2020年6月21日

How to Create Serverless Deployable Microservice using GAMBAS

Complete working GAMBAS code for this article is hosted at following URL…
Using GPS Emulator for facilitating onboard Systems Integration

2020年5月30日

Using GPS Emulator for facilitating onboard Systems Integration

I am firm believer in ability to design an emulator is an essential repertoire for any Systems integrator. For me…

1 条评论
Convert VHS/Hi8 tapes to digital format using EasyCap on Linux

2020年3月18日

Convert VHS/Hi8 tapes to digital format using EasyCap on Linux

Many of us would yet be having old VHS/Hi8 tapes lying around , waiting to be converted into digital format for easy…
Implementing OTA network using plain LoRa transceiver modules

2019年8月26日

Implementing OTA network using plain LoRa transceiver modules

LoRa is a spread spectrum modulation technique derived from chirp spread spectrum technology for commercial usage…

2 条评论
How to design a Sudoku puzzle ? The Algorithm.

2018年10月25日

How to design a Sudoku puzzle ? The Algorithm.

1. Classic Sudoku game involves a grid of 81 squares.
Ms Priya Prakash Varrier or Capt Kapil Kundu - Who is our role model ?

2018年2月17日

Ms Priya Prakash Varrier or Capt Kapil Kundu - Who is our role model ?

She is beautiful , her expressions too makes one rewind into raging hormone i.e teens.

3 条评论

See all articles

Web Scraping using Regular Expression and GAMBAS

Vinode Singh Ujlain

CTO/Head (R&D) , Systems Engineering , Software , IoT,OT & Cybersecurity,Ind 4.0, M2M , NPD, Electronics / Mechatronics, A&D | IIT Kharagpur, IIM Ahmadabad & Defense Services Staff College. Passionate Pythoneer

Is Web scraping legal?

How to use GAMBAS for Web scraping ?

Vinode Singh Ujlain的更多文章

社区洞察

其他会员也浏览了

AI & Web Scraping Chronicles: New Lawsuits, Educational Tutorials, Featured Tools

Web Scraping

Best Web Scraping Tools in 2023

Streamlining SEO Internal Linking with Python and AI

Web Scraping Software Market Comprehensive Study Explores Huge Growth in Future

Web Scraping

DIY Scraping Tools vs Managed Web Scraping - What to Choose?

Top Industries Requiring Web Scraping Services in 2025

Super scrapers!

Web Scraping Without Getting Blocked: A Detailed Guide on How to Bypass IP Blocking

Is Web scraping legal?

How to use GAMBAS for Web scraping ?

Vinode Singh Ujlain的更多文章

Securely send Pub key over socket & reduce MITM risk

Why Alpha Smooth Filter is Essential for Time Series Data Analysis

External Encrypt/Decrypt or use builtin DB engine function, which is faster in execution ?

Use of Stun gun in crime ?

How to Create Serverless Deployable Microservice using GAMBAS

Using GPS Emulator for facilitating onboard Systems Integration

Convert VHS/Hi8 tapes to digital format using EasyCap on Linux

Implementing OTA network using plain LoRa transceiver modules

How to design a Sudoku puzzle ? The Algorithm.

Ms Priya Prakash Varrier or Capt Kapil Kundu - Who is our role model ?

社区洞察

其他会员也浏览了

AI & Web Scraping Chronicles: New Lawsuits, Educational Tutorials, Featured Tools

Web Scraping

Best Web Scraping Tools in 2023

Streamlining SEO Internal Linking with Python and AI

Web Scraping Software Market Comprehensive Study Explores Huge Growth in Future

Web Scraping

DIY Scraping Tools vs Managed Web Scraping - What to Choose?

Top Industries Requiring Web Scraping Services in 2025

Super scrapers!

Web Scraping Without Getting Blocked: A Detailed Guide on How to Bypass IP Blocking