Speed up Python data access by 30x & more
Let's say you send a letter from London to Tokyo. How long would it take to get a reply? At the bare minimum, it takes 12 hours for a letter to fly there, and then another 12 hours for a reply to fly back, so 1 day at least (and this ignoring the time it takes for your letter to be read, the time it takes to write a reply, the time it takes to post it etc.). We could of course use faster means of communication like the phone or an e-mail. Whilst the delay is going to be much lower, the delay will be at least a few hundred milliseconds.
Whenever you are analysing market data in Python or indeed any other language, a lot of time is spent loading data, even before you do computations and statistical analysis. Just as with our letter example, often the data you are trying to access might be across a network. Hence, it takes time to fetch this data before you can put it into your computer's RAM. The difficulty is every time you make changes to your Python code to change your analysis, whatever you loaded up into memory is lost, once its finished running. So next time you run it, you have to go through the process of loading up the data, even though it's precisely the same dataset. In my Python market data library findatapy, I've written a wrapper for arctic (my code here), which has been open sourced by Man-AHL. It basically takes in pandas DataFrames, which can hold market data, compresses them heavily and sends them to MongoDB for storage. By compressing the data, it reduces the amount of storage on disk when it is stored by MongoDB. Also because the compression is done locally, it takes a load of the network when the data is before send to your computer.
As a bit of an experiment I used my library findatapy (via arctic) to access 1 minute data from 2007 to the present day for 12 G10 FX crosses, which is stored on my MongoDB server. The output of this query amounts to around 40 million observations. The Python code also does joins together all the time series and aligns them, which takes a bit of time. In total it took around 58 seconds to load all this FX data across my network and align it into a single dataset, which will be number crunched. My MongoDB setup is far from optimal, and the database I was accessing was across a wifi network, rather than a wired gigabit network etc. If every time I rerun my Python script, I have to go through this 58 seconds process to get a dataset, it's going to seriously slow down the process of market analysis, which is often an iterative process. Luckily, there are lots of tricks you can do to make this process faster. One solution is to cache the data in our local RAM in such a way that it will still be available even if we have to restart the process. We can use Redis to do this, which is a simple in memory database (basically a key/value store). When we've loaded up the data simply push it Redis to store temporarily. Whenever we need it, just pull it from Redis! When we fetch this large dataset via Redis, it takes under 2 seconds, nearly 30 times quicker! Why is it so much quicker? We list some reasons below...!
Read the rest of the article on the Cuemacro website here
Director de Tecnología y Transformación (CTO) en Abanca Gestión de Activos | Profesor de IA en Finanzas
8 年Great article. I've been using arctic from around one year know. I haven't use Redis but instead a mongo replica in local. I think your set up is fastest.
CEO @ Etna Research - Frontier AI for public capital markets | @macro_fintech on X
8 年great stuff!
Alternatives, Private Markets and Multi Asset investing
8 年Another example of how open source is enabling folks with financial skills get things done faster and better - good post Saeed Amen