Scrape Your Way to Data
In this article, we will get an insight on the core process of every 'meme data startup' out there - the?Web Scraper.
More precisely the python web scraper:?Scrapy?and it's HTML parser:?Beautiful Soup, along with an html expression language for selecting elements called?XPath.
Scraping and Selling
Scraping is an automated process of gathering information from the Internet.
Why automated?
Think of how repetitive it is to search for an item in your favorite webstore: all that clicking, scrolling, reading, redirection etc. This is not a viable process selling data at an industry-level to your meme data startup clients. And, if you are an engineer at heart, you know how bad repetitive manual work is - so we code in a web scraper to automatically ingest all that data.
Be warned though - constant maintenance is key to a scraper, given how fluid the internet is. That is why eventually you want to move your scrappy meme scraping startup to work with data vendors and data APIs.
Eyeballing a Website
What's the best place to get some meme material to sell??gettyimages.com with memes.
We want to create a database of meme images, but we need every meme image and its URL out there. Let's analyze it at an HTML level:?
Looking at that code, we see that the?tag encapsulates exactly what we want and from the??tag, we have the source of the image to grab and save in our DB.
Scaffolding Scraping
Create an environment (always create an environment when working with python):
conda create -n gettyscrapy python=3.8 -y?conda activate gettyscrapy
Install the required packages:
conda install scrapy beautifulsoup4 pysqlite3 -y
Quickstart a scraping project:?
scrapy startproject the_scraper_projectname
This will generate a?Scrapy?project in this format:
And finally, quickstart a?spider?to?crawl?the site:
scrapy genspider -t crawl gettyimages 'https://www.gettyimages.com/search/2/image?family=creative&phrase=meme'
Which will generate this code:
class GettyImagesSpider(CrawlSpider):
name = "gettyimages"
start_urls = ['https://www.gettyimages.com/search/2/image?family=creative&phrase=meme']
def __init__(self, topic, *args, **kwargs):
"""Constructor"""
def parse(self, response):
"""Image processor"""
Exciting isn't it, we have a script that can browse for you with this command:
scrapy crawl gettyimages
But?WAIT, before running a scraper, understand some rules of engagement to be a good scraper and not a data-stealing villain:
Now you can run the spider, it will just output the HTML in the console
Ingesting Scraped Crumbs
Back to our website analysis, this is the HTML we will work with:
领英推荐
<figure class="MosaicAsset-module__figure___qJh1Q" style="background-color:#da916d">
<picture>
<source srcset="https://media.gettyimages.com/photos/infidelity-concept-unfaithful-womanizer-guy-turning-around-amazed-at-picture-id1318934935?k=20&m=1318934935&s=612x612&w=0&h=YN6dO3HPwcUfkO1nQ9l_dApD-bl84_JouXd_-7jQWe8=">
<img class="MosaicAsset-module__thumb___yvFP5" src="https://media.gettyimages.com/photos/infidelity-concept-unfaithful-womanizer-guy-turning-around-amazed-at-picture-id1318934935?k=20&m=1318934935&s=612x612&w=0&h=YN6dO3HPwcUfkO1nQ9l_dApD-bl84_JouXd_-7jQWe8=" alt="infidelity concept. unfaithful womanizer guy turning around amazed at another woman while walking with his girlfriend on street - meme stock pictures, royalty-free photos & images" width="612" height="408">
</picture>
<figcaption>infidelity concept. unfaithful womanizer guy turning around amazed at another woman while walking with his girlfriend on street - meme stock pictures, royalty-free photos & images</figcaption>
</figure>
Using?BeautifulSoup, let's parse the page's HTML to get the images' url, this will be done using?XPath:
xpath_sel = response.xpath("https://picture").
xpath("img/@src")
Xpath will select all the?picture?tags, from the root of the document which is signalled through the use of characters?'//'. From there, look into every?img?tag and compose a list of urls from the src attribute, done with the selector:?@src.
Next, create a dictionary and?yield?it to the collector running this spider:
urls = img.getall()
if urls is not None and len(urls) != 0:
for url in urls:
yield {'url': url}
The crawler will?pipeline?this list of?items?it collected from the spiders, into the sqlite database we prepared and append a row therein:
class SpiderPipeline(object):
def open_spider(self, spider):
# called when the spider is opened
self.con = sqlite3.connect('urls.db') # create a DB
self.cur = self.con.cursor()
self.cur.execute(
'''DROP TABLE IF EXISTS urls''') # drop table if already exists
self.cur.execute('''CREATE TABLE urls (url)''') # create a table
self.con.commit()
def close_spider(self, spider):
# called when the spider is closed
self.con.close()
def process_item(self, item, spider):
# called for each item crawled from spiders/quotes-spiders.py
# insert the each item crawled into DB
self.cur.execute(
"INSERT INTO urls (url) VALUES( '" + item['url'] + "')")
self.con.commit()
return item
The?item?here is a simple dictionary, though it can be decorated or processed through the Item class:
class Item(Item):
pass # just a passthrough with no change.
Can I have Some More Scraps?
The getty images page has more pages, 52 at the time of writing this article. We need a method to get the next page.
Observe the URL:?https://www.gettyimages.com/photos/meme?assettype=image&phrase=meme&sort=mostpopular&license=rf%2Crm&page=2.
There is a url path element that define what?Page?we are looking at. Let's programmatically grab it:
parsed_url = urlparse(response.request.url)
captured_value = parse_qs(parsed_url.query)
page = 1 if captured_value is None or 'page' not in captured_value else int(
captured_value['page'][0]) + 1
With the library?urllib, we can wrap the url in an actionable object and parse the query string in it using?parse_qs. Parse_qs will return a tuple with the element and its value.
From here, we can signal to scrapy's crawler to take us to another page:
abs_url = re.sub('page=\d', f'page={page}', response.request.url)
max_pages = xpath_sel = response.xpath(
"https://span[@class = 'PaginationRow-module__lastPage___k9Pq7']/text()"
).get()
if max_pages is not None and len(max_pages) > 0 and int(max_pages) >= page:
yield Request(
url=abs_url,
callback=self.parse
)
Yielding a?Request?object, will run the spider on the new page provided in the returned URL. Note that we scrape the maximum number of pages, not to send our crawler to sites that are beyond our scope or don't exist. The HTML we are looking at is this:
<section class="PaginationRow-module__container___LxZJN">
<input type="text" maxlength="999" class="PaginationRow-module__input___VqORp" autocomplete="off" name="page" value="2"> of <span class="PaginationRow-module__lastPage___k9Pq7">52</span>
</section>
Difficulties with Scraping
Scraping is a lot of effort and maintenance.
Conclusion
In this article, we automated the browsing of a site, through the use of python's?scrapy,?beautifulsoup?and?xpath.
Scraping is used to collect data, in this case for our fictional meme startup, and further process it down the line. Use the code in this article to build your own or apply to one - it's a common interview question.
When scraping, always be fair and respectful to sites that you are crawling through, and follow the guidelines in?robots.txt!
References
Github
This Article?by?Adam Darmanin?is licensed under?CC BY-NC-SA 4.0