登录查看更多内容

Web scraping and alternative data for financial markets

Pierluigi Vinciguerra

Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club

发布日期: 2023年4月25日

We have seen in many posts how to scrape the web under several circumstances, like when there’s a Cloudflare-protected website or a mobile app. Still, we didn’t investigate much on which sectors are benefitting the most from the data just scraped.

Some of them are pretty obvious, I think we all know that each e-commerce who is selling online is trying to understand what its competitors are selling and the same applies to delivery apps and many other markets.

But there’s one sector that is always hungry for data since having a new and reliable dataset can lead to millions of dollars in benefits and it’s the financial one. With my current company, RE-Analytics , we’ve made (and still making) experience on the ground in this industry so in this post and in the next The Lab (scheduled for the 27th of April on The Web Scraping Club ) we’ll make a deep dive inside the world of Alternative Data.

What does it mean for Alternative Data?

In the financial sector, data is key for investors to take decisions about their investment strategy. Michael Bloomberg created an empire by selling financial data and news to investors. Jim Simmons, an illustrious mathematician, created the Reinassance Hedge Fund and he was one of the first people to use what today we would call data science in finance, to extract some competitive edge against other funds and close better deals. Medallion, the main fund which is closed to outside investors, has earned over $100 billion in trading profits since its inception in 1988. This translates to a 66.1% average gross annual return or a 39.1% average net annual return between 1988 – 2018, as Wikipedia states.

As the world’s economy becomes more and more digitalized, new sources of information are becoming available. So while traditional financial data sources, like financial statements, balances, and historical financial market data keep their importance, different sources from the digital world are becoming more and more interesting for the financial industry and they are called alternative data. According to the official definition, alternative data are information about a particular company that is published by sources outside of the company, which can provide unique and timely insights into investment opportunities.

What types of alternative data exist?

Since alternative data are data that don’t come from inside the company, this definition includes a wide range of possibilities.

As we can see from this slide from alternativedata.org, in addition to web data, we have satellite images, credit card transactions, sentiment data, and so on.

In this 2010 article we can understand better how satellite images can be used for financial purposes: a company called Remote Sensing Metrics LLC, using satellite images, monitored 100 Walmart parking lots as a representative sample and counted, month by month, the cars parked outside, to get an estimate of the quarterly revenues by counting the people flow.

As you can easily understand, you probably won’t get the right amount with a single dollar precision but, with a proper model, as soon as the quarter ends you might have an estimate of the revenues months before the public data is available to everyone, for a publicly traded stock.

Why web scraping is important in the alternative data landscape?

As we have seen from the slide before, alternative data providers can be divided in two: who owns or rework other’s data (in the satellite image example made before, the firm acquired the images from providers and tailored its data product to Walmart revenue estimator) and who extract data from public sources and derive insights from it (web data category but also sentiment).

An example of the second category can be a company that, starting from online reviews, extracts the sentiment of the customers and understand if the targeted company is losing its grip on its customers. Or scraping e-commerce data, we can understand if a brand is much more on sales than its direct competitors, this can be a sign of a product or sales issues.

What are the attention points for web-scraped data in the financial industry?

Financial markets are strictly regulated to avoid fraud and so-called insider trading, the trading of stocks based on non-public information about the company, and other types of legal issues. In fact, using data collected in an improper way can have legal repercussions for the fund itself and its managers.

For this reason, if you want to sell data to hedge funds and investors, be prepared for a lot of paperwork.

A good article by Zyte explains in detail what you need to prove to funds to demonstrate that you collected data properly. As stated in the article:

Generally speaking, the risks associated with alternative data can be broken into four categories:

Exclusivity & Insider Trading
Privacy Violations
Copyright Infringement
Data Acquisition

Let’s briefly summarize the risks related to the four points before.

Exclusivity & Insider Trading

As we said before, the practice of insider trading is when data non publicly available is used for trading stocks. This means that data behind a paywall are generally not allowed since are not publicly available for everyone but only for paying users of the target websites. Also, if a scraper needs to log in to the target website for getting some data, it raises some reds flag and you must be sure that you don’t break any TOS doing so.

Privacy violations

This applies when scraping personal data from the web, since in the last few years privacy regulations around the globe, especially in Europe with GDPR, have become more and more restrictive.

领英推荐

The Digital Rolodex

Helen Wall 1 年前

Understanding News Crawlers: Stay Informed on the Go

Forage AI 8 个月前

Zyte’s new AI-powered web data feeds enable unlimited…

Zyte 4 个月前

For this reason, scraping personal data is generally a no-go in every project, unless you can anonymize it.

Copyright Infringement

Scraping and reselling data that is protected by copyright, like photos and articles is not a great idea, especially in this case. Of course, it’s a big NO.

Data acquisitions

Funds for sure will want to understand if the whole data acquisition process was executed in the fairest way possible or if it could have caused some harm to the target websites.

For this reason, the Investment Data Standards Organization released a checklist with best practices to follow for web scraping.

You can find all the points in the linked file, but just to give an idea, here are some points:

6 A data collector should assess a website according to the terms of its robots.txt.
7 A data collector should access websites in a way that the access does not interfere with or impose an undue burden on their operation.
8 A data collector should not access, download or transmit non-public website data.
9 A data collector should not circumvent logins or other assess control restrictions such as captcha.
10 A data collector should not utilize IP masking or rotation to avoid website restrictions.
11 A data collector should respect valid cease and desist notices and the website’s right to govern the terms of access to the website and data.
12 A data collector should respect all copyright and trademark ownership and not act so as to obscure or delete copyright management information.

As you can see, the guidelines are very strict to avoid any possible problems for the data provider and the fund itself.

Key features for web-scraped data in the financial industry

Given all this information and premises, what are the features required by the financial industry to consider a dataset an interesting one?

Well, first of all, we need to understand that funds have all their own strategy for studying the markets and this impact what they are looking for.

Over simplifying, we can divide the funds into two macro-categories, quantitative and fundamental, knowing that many funds fall in the shades between these two poles, mixing the two strategies.

Basically, fundamental investors study economics and learn about the economics, the business model, and risk factors for a single company, in a sort of bottom-up approach. Quants try to elaborate complex machine learning models fueled by a lot of data, trying to spot correlations between data ingested and the stock market, in a top-down approach.

As you can imagine, the two approaches require different types of data. For fundamentals, the dataset can be very specific to a stock ticker but it should tell something valuable about the business model of the target stock. For quants instead, since they need data for training some ML models, it should have some history (typically some years, how many depends on the market you’re covering and their model needs) and many stocks covered, otherwise, it could be inefficient to add your data to the model for only a few stocks.

Final remarks

With this article, I wanted to introduce the Alternative Data landscape, since it’s one of the growing sectors in the data world where web scraping has a significant role.

It’s a complex environment, where there are some strict rules on data sourcing, and also a provider must prove its accountability, so it’s very difficult for a freelancer to enter, while it could be easier for established companies.

Depending on the investors you might have different requirements in timeframe and specs of data but if you can build up a great data product it could be a rewarding field to enter.

In the next The Lab episode we’ll try to build a dataset for the financial investors as a fun experiment.

Pierluigi Vinciguerra的更多文章

The 2025 Web Scraping Tech Stack

2025年1月12日

The 2025 Web Scraping Tech Stack

Web scraping is getting tougher every year, and 2025 is no exception. As someone who’s been deep in the trenches of…

4 条评论
The new OpenAI User Agent and its consequences

2023年8月9日

The new OpenAI User Agent and its consequences

The latest post by Gergely Orosz from The Pragmatic Engineer put the focus on one of the major concerns about the…

1 条评论
What is device fingerprinting?

2023年5月21日

What is device fingerprinting?

This is a post from The Web Scraping Club newsletter, if you don't want to miss other posts about web scraping tools…

11 条评论
Web Scraping news recap - April 2023

2023年5月1日

Web Scraping news recap - April 2023

Hi everyone and welcome back to The Web Scraping Club, this post is our monthly review of what happened in the web…

1 条评论
Writing a web scraper with ChatGPT. Is it a good idea?

2023年4月16日

Writing a web scraper with ChatGPT. Is it a good idea?

In November, after OpenAI released ChatGPT, based on GPT-3, the news was literally everywhere. I also wrote on that…

4 条评论
How to scrape Datadome protected websites (early 2023 version)

2023年4月14日

How to scrape Datadome protected websites (early 2023 version)

Let’s continue our journey on the tackle of antibot systems. Today, after seeing Kasada and Cloudflare, it’s the turn…

4 条评论
XPath vs CSS selectors: a comparison

2023年4月2日

XPath vs CSS selectors: a comparison

When creating a web scraper, one of the first decisions is to choose which type of selector to use. But what is a…

10 条评论
Bypass Cloudflare with these web scraping tools

2023年2月14日

Bypass Cloudflare with these web scraping tools

In this article of The Web Scraping Club we see the Python tools we can use to bypass Cloudflare protected websites…

11 条评论
Bypass Cloudflare Bot Protection with GoLogin

2023年1月19日

Bypass Cloudflare Bot Protection with GoLogin

Here's an abstract from the latest post on The Web Scraping Club substack, where we tackle Cloudflare anti-bot solution…

12 条评论
How I've built my home made mobile proxy

2023年1月15日

How I've built my home made mobile proxy

This article is published on The Web Scraping Club substack. If you liked it and don't want to miss other updates on…

10 条评论

See all articles

Web scraping and alternative data for financial markets

Pierluigi Vinciguerra

Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club

What does it mean for Alternative Data?

What types of alternative data exist?

Why web scraping is important in the alternative data landscape?

What are the attention points for web-scraped data in the financial industry?

Exclusivity & Insider Trading

Privacy violations

领英推荐

Copyright Infringement

Data acquisitions

Key features for web-scraped data in the financial industry

Final remarks

Top posts from The Web Scraping Club

Pierluigi Vinciguerra的更多文章

社区洞察

其他会员也浏览了

The Intersection of Data Science and Finance: Driving Business Success

DAX variables, virtual relationships and iterators!

Gaussian Mixture Models - on market regimes.

Rolling Origin Sampling

The World of Data

My 20/20 View of 2020

Unlocking Insights: Exploring the Power of Data Scraping in the Digital Age

Understanding the Trading Data Ecosystem: The Backbone of Modern Capital Markets

Modern Data Science: Monogamy or Ménage à trois ?

Stop Using Vector Indexes (When You Don't Need Them)

What does it mean for Alternative Data?

What types of alternative data exist?

Why web scraping is important in the alternative data landscape?

What are the attention points for web-scraped data in the financial industry?

Exclusivity & Insider Trading

Privacy violations

领英推荐

Copyright Infringement

Data acquisitions

Key features for web-scraped data in the financial industry

Final remarks

Top posts from The Web Scraping Club

Pierluigi Vinciguerra的更多文章

The 2025 Web Scraping Tech Stack

The new OpenAI User Agent and its consequences

What is device fingerprinting?

Web Scraping news recap - April 2023

Writing a web scraper with ChatGPT. Is it a good idea?

How to scrape Datadome protected websites (early 2023 version)

XPath vs CSS selectors: a comparison

Bypass Cloudflare with these web scraping tools

Bypass Cloudflare Bot Protection with GoLogin

How I've built my home made mobile proxy

社区洞察

其他会员也浏览了

The Intersection of Data Science and Finance: Driving Business Success

DAX variables, virtual relationships and iterators!

Gaussian Mixture Models - on market regimes.

Rolling Origin Sampling

The World of Data

My 20/20 View of 2020

Unlocking Insights: Exploring the Power of Data Scraping in the Digital Age

Understanding the Trading Data Ecosystem: The Backbone of Modern Capital Markets

Modern Data Science: Monogamy or Ménage à trois ?

Stop Using Vector Indexes (When You Don't Need Them)