Web scraping and alternative data for financial markets
Pierluigi Vinciguerra
Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club
We have seen in many posts how to scrape the web under several circumstances, like when there’s a Cloudflare-protected website or a mobile app. Still, we didn’t investigate much on which sectors are benefitting the most from the data just scraped.
Some of them are pretty obvious, I think we all know that each e-commerce who is selling online is trying to understand what its competitors are selling and the same applies to delivery apps and many other markets.
But there’s one sector that is always hungry for data since having a new and reliable dataset can lead to millions of dollars in benefits and it’s the financial one. With my current company, RE-Analytics , we’ve made (and still making) experience on the ground in this industry so in this post and in the next The Lab (scheduled for the 27th of April on The Web Scraping Club ) we’ll make a deep dive inside the world of Alternative Data.
What does it mean for Alternative Data?
In the financial sector, data is key for investors to take decisions about their investment strategy
As the world’s economy becomes more and more digitalized, new sources of information are becoming available. So while traditional financial data sources, like financial statements, balances, and historical financial market data keep their importance, different sources from the digital world are becoming more and more interesting for the financial industry and they are called alternative data. According to the official definition, alternative data are information about a particular company that is published by sources outside of the company, which can provide unique and timely insights into investment opportunities.
What types of alternative data exist?
Since alternative data are data that don’t come from inside the company, this definition includes a wide range of possibilities.
As we can see from this slide from alternativedata.org, in addition to web data, we have satellite images, credit card transactions, sentiment data
In this 2010 article we can understand better how satellite images can be used for financial purposes: a company called Remote Sensing Metrics LLC, using satellite images, monitored 100 Walmart parking lots as a representative sample and counted, month by month, the cars parked outside, to get an estimate of the quarterly revenues by counting the people flow.
As you can easily understand, you probably won’t get the right amount with a single dollar precision but, with a proper model, as soon as the quarter ends you might have an estimate of the revenues months before the public data is available to everyone, for a publicly traded stock.
Why web scraping is important in the alternative data landscape?
As we have seen from the slide before, alternative data providers can be divided in two: who owns or rework other’s data (in the satellite image example made before, the firm acquired the images from providers and tailored its data product to Walmart revenue estimator) and who extract data from public sources and derive insights from it (web data category but also sentiment).
An example of the second category can be a company that, starting from online reviews, extracts the sentiment of the customers and understand if the targeted company is losing its grip on its customers. Or scraping e-commerce data, we can understand if a brand is much more on sales than its direct competitors, this can be a sign of a product or sales issues.
What are the attention points for web-scraped data in the financial industry?
Financial markets are strictly regulated to avoid fraud and so-called insider trading
For this reason, if you want to sell data to hedge funds and investors, be prepared for a lot of paperwork.
A good article by Zyte explains in detail what you need to prove to funds to demonstrate that you collected data properly. As stated in the article:
Generally speaking, the risks associated with alternative data can be broken into four categories:
Let’s briefly summarize the risks related to the four points before.
Exclusivity & Insider Trading
As we said before, the practice of insider trading is when data non publicly available is used for trading stocks. This means that data behind a paywall are generally not allowed since are not publicly available for everyone but only for paying users of the target websites. Also, if a scraper needs to log in to the target website for getting some data, it raises some reds flag and you must be sure that you don’t break any TOS doing so.
Privacy violations
This applies when scraping personal data from the web, since in the last few years privacy regulations
领英推荐
For this reason, scraping personal data is generally a no-go in every project, unless you can anonymize it.
Copyright Infringement
Scraping and reselling data that is protected by copyright, like photos and articles is not a great idea, especially in this case. Of course, it’s a big NO.
Data acquisitions
Funds for sure will want to understand if the whole data acquisition process was executed in the fairest way possible or if it could have caused some harm to the target websites.
For this reason, the Investment Data Standards Organization released a checklist with best practices to follow for web scraping.
You can find all the points in the linked file, but just to give an idea, here are some points:
As you can see, the guidelines are very strict to avoid any possible problems for the data provider and the fund itself.
Key features for web-scraped data in the financial industry
Given all this information and premises, what are the features required by the financial industry to consider a dataset an interesting one?
Well, first of all, we need to understand that funds have all their own strategy for studying the markets and this impact what they are looking for.
Over simplifying, we can divide the funds into two macro-categories, quantitative and fundamental, knowing that many funds fall in the shades between these two poles, mixing the two strategies.
Basically, fundamental investors study economics and learn about the economics, the business model, and risk factors for a single company, in a sort of bottom-up approach. Quants try to elaborate complex machine learning models fueled by a lot of data, trying to spot correlations between data ingested and the stock market, in a top-down approach.
As you can imagine, the two approaches require different types of data. For fundamentals, the dataset can be very specific to a stock ticker but it should tell something valuable about the business model of the target stock. For quants instead, since they need data for training some ML models, it should have some history (typically some years, how many depends on the market you’re covering and their model needs) and many stocks covered, otherwise, it could be inefficient to add your data to the model for only a few stocks.
Final remarks
With this article, I wanted to introduce the Alternative Data landscape, since it’s one of the growing sectors in the data world where web scraping has a significant role.
It’s a complex environment, where there are some strict rules on data sourcing, and also a provider must prove its accountability, so it’s very difficult for a freelancer to enter, while it could be easier for established companies.
Depending on the investors you might have different requirements in timeframe and specs of data but if you can build up a great data product it could be a rewarding field to enter.
In the next The Lab episode we’ll try to build a dataset for the financial investors as a fun experiment.
Top posts from The Web Scraping Club
Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club
1 年https://substack.thewebscraping.club/p/web-scraping-and-alternative-data