登录查看更多内容

Scrape Your Way to Data

Adam Darmanin

Senior Engineering Director | Blackrock | Quant | AI | Cloud

发布日期: 2022年8月5日

In this article, we will get an insight on the core process of every 'meme data startup' out there - the?Web Scraper.

More precisely the python web scraper:?Scrapy?and it's HTML parser:?Beautiful Soup, along with an html expression language for selecting elements called?XPath.

Scraping and Selling

Scraping is an automated process of gathering information from the Internet.

Why automated?

Think of how repetitive it is to search for an item in your favorite webstore: all that clicking, scrolling, reading, redirection etc. This is not a viable process selling data at an industry-level to your meme data startup clients. And, if you are an engineer at heart, you know how bad repetitive manual work is - so we code in a web scraper to automatically ingest all that data.

Be warned though - constant maintenance is key to a scraper, given how fluid the internet is. That is why eventually you want to move your scrappy meme scraping startup to work with data vendors and data APIs.

Eyeballing a Website

What's the best place to get some meme material to sell??gettyimages.com with memes.

We want to create a database of meme images, but we need every meme image and its URL out there. Let's analyze it at an HTML level:?

Looking at that code, we see that the?tag encapsulates exactly what we want and from the??tag, we have the source of the image to grab and save in our DB.

Scaffolding Scraping

Create an environment (always create an environment when working with python):

conda create -n gettyscrapy python=3.8 -y?conda activate gettyscrapy

Install the required packages:


conda install scrapy beautifulsoup4 pysqlite3 -y

Quickstart a scraping project:?

scrapy startproject the_scraper_projectname

This will generate a?Scrapy?project in this format:

the_main_dir
the_scraper_projectname
settings.py?- parameters for scrapers and runners, e.g multithreading, bot name, respect robot.txt guidelines, etc.
items.py?- the structured data coming out of a scraper.
middleware.py?- additional functionality provided by scrapy or hooks in various stages of the lifecycle.
pipeline.py?- glueing everything togther and persisting the data somewhere.
spiders?- where are scrapers are located
spider1.py
scrapy.cfg?- Holds important variables for a scrapy deployment, e.g. project name and configuration file path.

And finally, quickstart a?spider?to?crawl?the site:

scrapy genspider -t crawl gettyimages 'https://www.gettyimages.com/search/2/image?family=creative&phrase=meme'

Which will generate this code:

class GettyImagesSpider(CrawlSpider):
    name = "gettyimages"
    start_urls = ['https://www.gettyimages.com/search/2/image?family=creative&phrase=meme']


    def __init__(self, topic, *args, **kwargs):
        """Constructor"""

    def parse(self, response):
        """Image processor"""

Exciting isn't it, we have a script that can browse for you with this command:

scrapy crawl gettyimages

But?WAIT, before running a scraper, understand some rules of engagement to be a good scraper and not a data-stealing villain:

Don't be a burden to the site's infrastructure - i.e. don't?DDOS?the site.
Respect the?robots.txt?- follow the delays and guidelines defined here.
Don't violate copyright - The images we are scraping are royalty free (not that it ever stopped any 'high' growth startup).
Don't breach GDPR (unless you are a startup outside of that jurisdiction, then goodluck for those whose data was breach).

Now you can run the spider, it will just output the HTML in the console

Ingesting Scraped Crumbs

Back to our website analysis, this is the HTML we will work with:

领英推荐

Deepchecks for Data and Model Validation

Aishwarya Srinivasan 2 年前

The actual differences between Ontologies and Graph…

Nicolas Figay 1 个月前

Introducing Zyte API Enterprise – Technology +…

Zyte 1 年前

<figure class="MosaicAsset-module__figure___qJh1Q" style="background-color:#da916d">
  <picture>
    <source srcset="https://media.gettyimages.com/photos/infidelity-concept-unfaithful-womanizer-guy-turning-around-amazed-at-picture-id1318934935?k=20&amp;m=1318934935&amp;s=612x612&amp;w=0&amp;h=YN6dO3HPwcUfkO1nQ9l_dApD-bl84_JouXd_-7jQWe8=">
    <img class="MosaicAsset-module__thumb___yvFP5" src="https://media.gettyimages.com/photos/infidelity-concept-unfaithful-womanizer-guy-turning-around-amazed-at-picture-id1318934935?k=20&amp;m=1318934935&amp;s=612x612&amp;w=0&amp;h=YN6dO3HPwcUfkO1nQ9l_dApD-bl84_JouXd_-7jQWe8=" alt="infidelity concept. unfaithful womanizer guy turning around amazed at another woman while walking with his girlfriend on street - meme stock pictures, royalty-free photos &amp; images" width="612" height="408">
  </picture>
    <figcaption>infidelity concept. unfaithful womanizer guy turning around amazed at another woman while walking with his girlfriend on street - meme stock pictures, royalty-free photos &amp; images</figcaption>
</figure>

Using?BeautifulSoup, let's parse the page's HTML to get the images' url, this will be done using?XPath:

xpath_sel = response.xpath("https://picture").
              xpath("img/@src")

Xpath will select all the?picture?tags, from the root of the document which is signalled through the use of characters?'//'. From there, look into every?img?tag and compose a list of urls from the src attribute, done with the selector:?@src.

Next, create a dictionary and?yield?it to the collector running this spider:

urls = img.getall()
if urls is not None and len(urls) != 0:
    for url in urls:
        yield {'url': url}

The crawler will?pipeline?this list of?items?it collected from the spiders, into the sqlite database we prepared and append a row therein:

class SpiderPipeline(object):
    def open_spider(self, spider):
        # called when the spider is opened
        self.con = sqlite3.connect('urls.db')  # create a DB
        self.cur = self.con.cursor()
        self.cur.execute(
            '''DROP TABLE IF EXISTS urls''')  # drop table if already exists
        self.cur.execute('''CREATE TABLE urls (url)''')  # create a table
        self.con.commit()

    def close_spider(self, spider):
        # called when the spider is closed
        self.con.close()

    def process_item(self, item, spider):
        # called for each item crawled from spiders/quotes-spiders.py
        # insert the each item crawled into DB
        self.cur.execute(
            "INSERT INTO urls (url) VALUES( '" + item['url'] + "')")
        self.con.commit()
        return item

The?item?here is a simple dictionary, though it can be decorated or processed through the Item class:

class Item(Item):
    pass # just a passthrough with no change.

Can I have Some More Scraps?

Mark Lester from 1968 adaptation of Oliver Twis

The getty images page has more pages, 52 at the time of writing this article. We need a method to get the next page.

Observe the URL:?https://www.gettyimages.com/photos/meme?assettype=image&phrase=meme&sort=mostpopular&license=rf%2Crm&page=2.

There is a url path element that define what?Page?we are looking at. Let's programmatically grab it:

parsed_url = urlparse(response.request.url)
captured_value = parse_qs(parsed_url.query)
page = 1 if captured_value is None or 'page' not in captured_value else int(
    captured_value['page'][0]) + 1

With the library?urllib, we can wrap the url in an actionable object and parse the query string in it using?parse_qs. Parse_qs will return a tuple with the element and its value.

From here, we can signal to scrapy's crawler to take us to another page:

abs_url = re.sub('page=\d', f'page={page}', response.request.url)
max_pages = xpath_sel = response.xpath(
    "https://span[@class =  'PaginationRow-module__lastPage___k9Pq7']/text()"
).get()

if max_pages is not None and len(max_pages) > 0 and int(max_pages) >= page:
    yield Request(
        url=abs_url,
        callback=self.parse
    )

Yielding a?Request?object, will run the spider on the new page provided in the returned URL. Note that we scrape the maximum number of pages, not to send our crawler to sites that are beyond our scope or don't exist. The HTML we are looking at is this:

<section class="PaginationRow-module__container___LxZJN">
  <input type="text" maxlength="999" class="PaginationRow-module__input___VqORp" autocomplete="off" name="page" value="2"> of <span class="PaginationRow-module__lastPage___k9Pq7">52</span>
</section>

Difficulties with Scraping

The Web grows organically, changes constantly and is a hot mess. You scraper needs to be updated often - or switch to?APIs?if the site offers these.
Each website - and sometimes the pages within - is unique and will need custom code to scrape.
Durability - Given that sites change often and without warning, your scraper has to fail gracefully and often.

Scraping is a lot of effort and maintenance.

Conclusion

In this article, we automated the browsing of a site, through the use of python's?scrapy,?beautifulsoup?and?xpath.

Scraping is used to collect data, in this case for our fictional meme startup, and further process it down the line. Use the code in this article to build your own or apply to one - it's a common interview question.

When scraping, always be fair and respectful to sites that you are crawling through, and follow the guidelines in?robots.txt!

References

Github

Article is available on?Github?and its sample scraper can be forked from this other?repo

This Article?by?Adam Darmanin?is licensed under?CC BY-NC-SA 4.0

要查看或添加评论，请登录

Adam Darmanin的更多文章

Deep Q-Learning Applied to Algorithmic Trading

2024年10月12日

Deep Q-Learning Applied to Algorithmic Trading

In the book "A Random Walk Down Wall Street", the author Burton G. Malkiel claimed that: “a blindfolded monkey throwing…

3 条评论
Machine Learning and Stochastic Models for Predicting FOMC Meetings' Impact

2024年7月29日

Machine Learning and Stochastic Models for Predicting FOMC Meetings' Impact

In this article, we will use stochastic financial modeling and machine learning to estimated the impact of the next…
Bond Valuations Explained With Python

2024年7月24日

Bond Valuations Explained With Python

Bond selection is a good exercise in rounding ones portfolio, mainly determining the fair value of a bond based on its…
Unsupervised Learning as Signals for Pairs Trading and StatArb

2024年3月6日

Unsupervised Learning as Signals for Pairs Trading and StatArb

In this article, we will leverage clustering algorithms to detect pairs in a universe of tradable securities. We will…

1 条评论
Temporal Convolutional Neural Network with Conditioning for Broad Market Signals

2024年2月10日

Temporal Convolutional Neural Network with Conditioning for Broad Market Signals

There is a naive strategy amongst the alleged day-traders, that if you buy a levered ETF at open, and sell at close…
Analysis of a Naive Statistical Arbitrage for Trading

2023年12月3日

Analysis of a Naive Statistical Arbitrage for Trading

Statistical arbitrage (StatArb) is any technique in quantitative finance that uses statistical and mathematical models…
Solving for the Efficient Frontier in Stock Portfolios

2023年11月19日

Solving for the Efficient Frontier in Stock Portfolios

The Efficient Frontier, a core concept in Harry Markowitz's Modern Portfolio Theory (MPT), is used in quantitative…
Analysis of Oscillators as Technical Trading Signals

2023年11月5日

Analysis of Oscillators as Technical Trading Signals

Oscillators are momentum indicators used in quantitative trading to identify turning points. Using moving averages and…

1 条评论
Pairs-Trading Strategy Analysis

2023年11月1日

Pairs-Trading Strategy Analysis

Pairs trading is a more advanced strategy that quants would use to trade a portfolio, rather than guessing on one…

1 条评论
Momentum and Reversion Trading Signals Analysis

2023年11月1日

Momentum and Reversion Trading Signals Analysis

In the world of quantitative trading, various strategies are employed to gain a competitive edge in financial markets…

See all articles

Scrape Your Way to Data

Adam Darmanin

Senior Engineering Director | Blackrock | Quant | AI | Cloud

Scraping and Selling

Eyeballing a Website

Scaffolding Scraping

Ingesting Scraped Crumbs

领英推荐

Can I have Some More Scraps?

Difficulties with Scraping

Conclusion

References

Github

Adam Darmanin的更多文章

社区洞察

其他会员也浏览了

Python eats away at R: Top Software for Analytics, Data Science, Machine Learning in 2018 -Trends and Analysis

Using Directed Acyclic Graphs in Airflow to Automate Datapipelines.

Text Parsing in Python with US-Patent Data

Evaluating Snowflake for Generative AI Solutions: A Journey from Novice to Practitioner

SQL Challenge: Number Of Custom Email Labels

Handling Duplicates using Pandas DataFrames

Accessing Data with loc: Label-Based Indexing in Pandas

Data insights

Construct of Data Connectors using Python for routine ML tasks

Scraping and Selling

Eyeballing a Website

Scaffolding Scraping

Ingesting Scraped Crumbs

领英推荐

Can I have Some More Scraps?

Difficulties with Scraping

Conclusion

References

Github

Adam Darmanin的更多文章

Deep Q-Learning Applied to Algorithmic Trading

Machine Learning and Stochastic Models for Predicting FOMC Meetings' Impact

Bond Valuations Explained With Python

Unsupervised Learning as Signals for Pairs Trading and StatArb

Temporal Convolutional Neural Network with Conditioning for Broad Market Signals

Analysis of a Naive Statistical Arbitrage for Trading

Solving for the Efficient Frontier in Stock Portfolios

Analysis of Oscillators as Technical Trading Signals

Pairs-Trading Strategy Analysis

Momentum and Reversion Trading Signals Analysis

社区洞察

其他会员也浏览了

Python eats away at R: Top Software for Analytics, Data Science, Machine Learning in 2018 -Trends and Analysis

Using Directed Acyclic Graphs in Airflow to Automate Datapipelines.

Text Parsing in Python with US-Patent Data

Evaluating Snowflake for Generative AI Solutions: A Journey from Novice to Practitioner

SQL Challenge: Number Of Custom Email Labels

Handling Duplicates using Pandas DataFrames

Accessing Data with loc: Label-Based Indexing in Pandas

Data insights

Construct of Data Connectors using Python for routine ML tasks