The tools that transformed the way I work
Laurent Bilke
CEO and Head of Research at Alternative Macro Signals | NLP & ML applied to Macro | Economics and Monetary Policy | Digital Innovation Enthusiast |
I spend a lot of time building models and have witnessed the explosion of non-structured data, mostly text, in finance. Data availability and models are moving really fast now.
The toolbox needs to keep up, to make that transformation possible.
I thought I would simply share a couple of the tools which have made the jump into non-structured data world possible, as far as I am concerned.
In part, that's to help anybody going through the same motion today - maybe some of the ideas below can save them time. But it is also as a token thank you towards these tools because they are free to use, and they helped me so much.
1) MongoDB: solved the text database problem
I must confess I knew nothing about NoSQL databases before working full-time on text data. I come from time-series modelling for which the database set up is generally straightforward - not super fun, but straightforward.
Obviously, to store text is another beast.
There is no way I would be able to analyze millions of news articles without a powerful unstructured database (No-SQL) such as MongoDB.
The revelation was: just store everything as JSON.
Everything!
Give the data the structure you need.
One item in the MongoDB database (a "document") will be one news article, with separate lines for title, content, sources, release date, the output of all the models, their parameters, the manual labels, etc. It is possible to do that in a structured database. But that would involve a huge amount of time just setting up the format. I prefer to spend the time working on the models.
And the best part is: time-series will also perfectly fit in there... It is much easier to fit structured data in an unstructured database than the other way around.
I came to appreciate Compass which comes with MongoDB. I think it's an essential tool to visualize text data. I can build pipelines (long and complicated queries) in Compass, to check everything works as intended, and then use them in Python. Before Compass, scanning through my database was painful.
Obviously, MongoDB works seamlessly in Python thanks to the PyMongo library.
A side note, in case anybody from statistical offices reads this post: please adopt the JSON-stat standard!
Statistical offices are all coming up with their own API which is great, but their JSONs are all over the place. JSON-stat is an impressive attempt to harmonize things. Kudos to Eurostat and the other statistical offices (mostly in Europe) who have adopted it. - And yes, we are watching you, the US BLS and BEA...
2) Hugging Face Transformers: state-of-the Art Natural Language Processing made accessible
Now, the models.
I am not sure what Natural Language Processing looked like before Hugging Face's Transformers library... But I do realize the amount of work that is going on there to consistently deliver state-of-the-art transformer models in a streamlined pipeline.
This library (and all the work behind) are one of the best illustration of an accelerating transfer technology: they cut the time from state-of-the-art academia model to street application to almost nothing. I don't know if there are any measures of that phenomenon (there must be somewhere), but I would guess that transmission lag must have gone from quarters or years in the early 00s to just a few weeks now.
In any case, the Transformers library allows me to compare those huge language models and to retrain them to fit the specific job I need them to do.
That a single library can fit in some many different models is incredible.
Of course, one still needs to know what they are doing. The time saved setting up the infrastructure or linking the models can be used in properly analyzing the data and thinking about the modelling strategy.
I do believe Natural Language Processing is one of the most rapidly moving and fascinating areas in Machine Learning today and there remain so many applications to be explored in my domain (finance and macro). And Huggingface's Transformers is definitely making a huge contribution there.
These two tools are very different, but there is a logic in associating them: there are part of the infrastructure that makes working on unstructured data possible.
2
Co-founder at Turnleaf Analytics / Macro forecasting with ML
4 年MongoDB is really easy to use, I often use it with Arctic, although potentially just storing in JSON is probably easier, like you suggest. It's cool that Transformers is open source too!