登录查看更多内容

The tools that transformed the way I work

Laurent Bilke

CEO and Head of Research at Alternative Macro Signals | NLP & ML applied to Macro | Economics and Monetary Policy | Digital Innovation Enthusiast |

发布日期: 2020年11月5日

I spend a lot of time building models and have witnessed the explosion of non-structured data, mostly text, in finance. Data availability and models are moving really fast now.

The toolbox needs to keep up, to make that transformation possible.

I thought I would simply share a couple of the tools which have made the jump into non-structured data world possible, as far as I am concerned.

In part, that's to help anybody going through the same motion today - maybe some of the ideas below can save them time. But it is also as a token thank you towards these tools because they are free to use, and they helped me so much.

1) MongoDB: solved the text database problem

I must confess I knew nothing about NoSQL databases before working full-time on text data. I come from time-series modelling for which the database set up is generally straightforward - not super fun, but straightforward.

Obviously, to store text is another beast.

There is no way I would be able to analyze millions of news articles without a powerful unstructured database (No-SQL) such as MongoDB.

The revelation was: just store everything as JSON.

Everything!

Give the data the structure you need.

One item in the MongoDB database (a "document") will be one news article, with separate lines for title, content, sources, release date, the output of all the models, their parameters, the manual labels, etc. It is possible to do that in a structured database. But that would involve a huge amount of time just setting up the format. I prefer to spend the time working on the models.

And the best part is: time-series will also perfectly fit in there... It is much easier to fit structured data in an unstructured database than the other way around.

I came to appreciate Compass which comes with MongoDB. I think it's an essential tool to visualize text data. I can build pipelines (long and complicated queries) in Compass, to check everything works as intended, and then use them in Python. Before Compass, scanning through my database was painful.

Obviously, MongoDB works seamlessly in Python thanks to the PyMongo library.

A side note, in case anybody from statistical offices reads this post: please adopt the JSON-stat standard!

Statistical offices are all coming up with their own API which is great, but their JSONs are all over the place. JSON-stat is an impressive attempt to harmonize things. Kudos to Eurostat and the other statistical offices (mostly in Europe) who have adopted it. - And yes, we are watching you, the US BLS and BEA...

2) Hugging Face Transformers: state-of-the Art Natural Language Processing made accessible

Now, the models.

I am not sure what Natural Language Processing looked like before Hugging Face's Transformers library... But I do realize the amount of work that is going on there to consistently deliver state-of-the-art transformer models in a streamlined pipeline.

This library (and all the work behind) are one of the best illustration of an accelerating transfer technology: they cut the time from state-of-the-art academia model to street application to almost nothing. I don't know if there are any measures of that phenomenon (there must be somewhere), but I would guess that transmission lag must have gone from quarters or years in the early 00s to just a few weeks now.

In any case, the Transformers library allows me to compare those huge language models and to retrain them to fit the specific job I need them to do.

That a single library can fit in some many different models is incredible.

Of course, one still needs to know what they are doing. The time saved setting up the infrastructure or linking the models can be used in properly analyzing the data and thinking about the modelling strategy.

I do believe Natural Language Processing is one of the most rapidly moving and fascinating areas in Machine Learning today and there remain so many applications to be explored in my domain (finance and macro). And Huggingface's Transformers is definitely making a huge contribution there.

These two tools are very different, but there is a logic in associating them: there are part of the infrastructure that makes working on unstructured data possible.

Saeed Amen

Co-founder at Turnleaf Analytics / Macro forecasting with ML

4 年

MongoDB is really easy to use, I often use it with Arctic, although potentially just storing in JSON is probably easier, like you suggest. It's cool that Transformers is open source too!

1 次回应

要查看或添加评论，请登录

Laurent Bilke的更多文章

Whose sentiment? One or two annoying things about "sentiment indicators" in macro

2020年12月3日

Whose sentiment? One or two annoying things about "sentiment indicators" in macro

I come from the Economics field, where one uses data to test a theory. Data science mostly works the other way around:…

2 条评论
Will austerity come after the pandemic?

2020年5月5日

Will austerity come after the pandemic?

A NY Fed paper shows a relationship (correlation) between Spanish flu and local fiscal spending and far-right vote in…
Are "coronavirus" Google Trends data useful?

2020年3月12日

Are "coronavirus" Google Trends data useful?

If you wake up one morning with a sick child at home, there is a good chance you will Google search "coronavirus…

2 条评论
Recession, corporate credit spreads and credit easing

2020年3月10日

Recession, corporate credit spreads and credit easing

Ahead of the ECB meeting Thursday, forget about a potential 10bp deposit rate cut. The real question is: can the ECB…

1 条评论
US inflation only 1.0% in the last 10 years, not 1.5% ..?

2020年2月27日

US inflation only 1.0% in the last 10 years, not 1.5% ..?

A Fed paper released yesterday claims mismeasurement in consumer digital access services have led to a significant…

4 条评论

See all articles

The tools that transformed the way I work

Laurent Bilke

CEO and Head of Research at Alternative Macro Signals | NLP & ML applied to Macro | Economics and Monetary Policy | Digital Innovation Enthusiast |

Laurent Bilke的更多文章

社区洞察

其他会员也浏览了

Import Data into Postgres Table Using Pandas

Build a question-answer bot natively using Postgres extensions

DBT and Databricks part 3: Loading noSQL data (from MongoDB) into Databricks

All Databases are Equal, but Some Databases are More Equal than Others

MongoDB Series - Part 1 - The Basics

Candlestick Pattern Analysis with MongoDB Vector?Search

Working with Semi-Structured JSON Data in Databricks

Cheers to Real-time Analytics with Apache Flink : Part 3 of 3

Building an Open, Multi-Engine Data Lakehouse with S3 and Python

Exploring Apache Spark: The Ultimate Guide to Big Data Mastery ??

Laurent Bilke的更多文章

Whose sentiment? One or two annoying things about "sentiment indicators" in macro

Will austerity come after the pandemic?

Are "coronavirus" Google Trends data useful?

Recession, corporate credit spreads and credit easing

US inflation only 1.0% in the last 10 years, not 1.5% ..?

社区洞察

其他会员也浏览了

Import Data into Postgres Table Using Pandas

Build a question-answer bot natively using Postgres extensions

DBT and Databricks part 3: Loading noSQL data (from MongoDB) into Databricks

All Databases are Equal, but Some Databases are More Equal than Others

MongoDB Series - Part 1 - The Basics

Candlestick Pattern Analysis with MongoDB Vector?Search

Working with Semi-Structured JSON Data in Databricks

Cheers to Real-time Analytics with Apache Flink : Part 3 of 3

Building an Open, Multi-Engine Data Lakehouse with S3 and Python

Exploring Apache Spark: The Ultimate Guide to Big Data Mastery ??