Web Scraping using Regular Expression and GAMBAS
https://www.fiverr.com/keval007/grab-data-from-website-you-provide

Web Scraping using Regular Expression and GAMBAS

Web-page when delivered to a browser is pure HTML tags, this also means that each web page is a structured document. Some of the definitions for web scraping are :-

(a) Web scraping is the practice of using a computer program to sift through a web page and gather the data in a useful format while at the same time preserving the structure of the data.

(b) Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

Is Web scraping legal?

Often, websites allow third party scraping. For example, most websites give Google the express or implied permission to index their web pages. Web scraping is clearly legal. A variety of laws may apply to unauthorized scraping, including contract, copyright and trespass etc.

Data scraping in itself may not be illegal, but the data (factual or otherwise) may be subject to copyright. Using it without the owner's permission, especially if you are selling it, could lead to legal action. However , as a general rule, if the information is public on the internet, it is legal to crawl it.

How to use GAMBAS for Web scraping ?

I am not certain how other tools handle web scraping. I however found use of regular expression (REGEX) raw , powerful & potent method to iterate through website & extract links , images & other data. The key here is - the coder of website often is a lazy programmer and uses a template across the site to deliver HTML. Thus , the scheme of crawling web page needs a bit of regular expression writing , hit & try till it fetches what you need. Once this is done , scraping a website & generating a database (or CSV - much easier) takes a mere few seconds. I was developing a desktop application for web radio (using GAMBAS on Linux). I thus needed a database of URLs that stream Radio/Audio. A bit of web scraping helped. 

Sample HTML String

<a href='https://www.radioau_try.net/#kiis-101-1' title='KIIS_#101.1'><img class='cover' src='https://cdn.webrad_try.io/images/logos/radioau-net/kiis-101-1.png' alt='KIIS 101.1' height='66' width='96'></a>


step 1
get text between the two <a> to </a> : <a.*/a>


Step 2
get HREF part : (href|HREF)\s*=\s*('|")\s*((http|HTTP)(s|S)*://[a-zA-Z0-9.-/#-\_]*\s*)('|")


Step 3
get Title : title\w{0,}=\w{0,}('|")\w[a-zA-Z0-9 .-]{0,}('|")


Step 4
Get Image Source :(src|SRC)\s*=\s*('|")s*(http|HTTP)(s|S)*://[a-zA-Z0-9.-/#-\_]*\s*('|")

Not going into the detailed GAMABS code on steps followed to iterate the URL (keep diving till all URLs are visited) , this shows sample CSV grab of the data that I could fetch using above scheme.

This is the finished WebRadio application for which I had to web-scrape. Will look to talk about this in another post.

Note : I code since I love doing so. It is ultimate creative thrill to write a working code or simply experiment using a new language or try out an idea that lingered along. Regular Expressions & Web Radio are what I wanted to write for a while.

Anshul U.

Sr. Principal Specialist (Data Science & Generative AI) | Researcher | Speaker | Inventor

6 年

Regular Expression, Xpath and Oxpath are really good for web scraping.

要查看或添加评论,请登录

Vinode Singh Ujlain的更多文章

社区洞察

其他会员也浏览了