Part I: NLP in Economics – Touching Base With the Basics

Part I: NLP in Economics – Touching Base With the Basics

By?Sam van de Schootbrugge

Summary

  • A new CEPR working paper, co-authored by UCL Professor Stephen Hansen provides an overview of the methods used for algorithmic text analysis in economics.
  • The paper covers the fundamentals, explaining how semantic meaning embedded in words can be captured by algorithms like BERT and GPT. We cover this in Part I.
  • In Part II, we will summarise how such data representations of text can be used to solve four common economic problems, and the challenges faced in doing so.

Introduction

The rapid development in Natural Language Processing (NLP) has fostered a diverse methodological frontier. While exciting, especially given the emergence of a new generation of deep neural network models known as Transformers, there remains little guidance for researchers on how best to deploy these new techniques.

This lack of structure means there is no common framework or even vocabulary for analysing text. In an attempt to bridge the gap,?a new?CEPR working paper?provides a conceptual overview of the methods that now form the basic building blocks of algorithmic text analysis in economics.

The Building Blocks of Text Analysis

Textual analysis begins with a ‘document’. These could be easily machine-readable texts (e.g., Word files or PDFs), but could also be in more challenging formats, such as in markup language (e.g., HTML or XML), or in scanned image files (e.g., PDFs of historical books). To extract these texts in Python, researchers typically use the following software packages:

  • Beautiful Soup (HTML/XML parsing).
  • Layout Parser (optical character recognition).

Once extracted and organised, raw documents are then converted into sequences of linguistic features by (i) splitting sentences on whitespace/punctuation (tokenising); (ii) dropping non-letter characters; (iii) dropping common stop words, like ‘the’/’to’/’is’; (iv) adjusting letters to lowercase, and (v)?stemming?words to remove suffices (the Porter stemmer is a common default).

In economics, this standard pre-processing approach represents documents as lists of words, typically reduced to some root form. One such representation is the?bag-of-wordsmodel, where each unique vocabulary term is assigned an index value from 1 through to V. Each term can then be counted by document and stored in a?document-term matrix.

This matrix stores the counts of each vocabulary term by document, where each column in the matrix represents term counts, and each row represents a document (Example 1). For example, ‘growth’ may show up 100 times in document 1, and zero times in document 20. Meanwhile, ‘recession’ does not show up in document 1, but has a high frequency in document 20.

No alt text provided for this image

Generally, there could be tens of thousands of columns in this high-dimensional matrix. The matrix is also?sparse, in that most vocabulary terms are not present in all documents (i.e., they are given a value of zero).

From PCA to LDA

The next step is to add meaning to the words. This involves reducing the dimension of our ‘document-term’ space into a more helpful ‘meaning’ space.

For economists, you can think of this as a factor analysis designed to capture structure in high-dimensional economic data. One of the most common dimensionality reduction techniques is a?principal component analysis?(PCA).

Say we wanted to know the?main driver of G4 headline inflation. In example 1, we would have four columns – US, EU, UK, and Japanese inflation data – and 120 rows – monthly observations over 10 years.

A PCA analysis would reduce this 120×4 matrix into a 4×1 vector, leaving just one column and four rows. The column is our principal component, the variable that explains most of the variation in our four inflation series. The rows represent how much this principal component explains headline inflation in each of our areas.

In this example, we find the degree to which inflation across multiple countries is influenced by a common macroeconomic driver. In textual analysis, we want to uncover the extent to which words across multiple documents are driven by common?themes.?For this, we use?Latent Dirichlet Allocation?(LDA).

LDA looks for a workable thematic summary of words in our document-term matrix. Each theme, or?topic, can be found by searching for groups of words that frequently occur together in documents across our body of texts. Each term within a topic is assigned a probability, and those assigned especially high probabilities govern the topic’s ‘theme.’

For example, high probabilities assigned to words like ‘quake’ or ‘tsunami’, are likely to imply that the topic they are uncovering is ‘natural disasters.’ Then, in turn, if ‘natural disasters’ is given a high probability versus other topics uncovered in the text corpus, it is likely to have come from a document on, say, global warming instead of FOMC meetings (Example 2).

No alt text provided for this image

More formally, in LDA topic modelling, documents are probability distributions over latent topics and the topics themselves are probability distributions over words. Just like PCA, labelling the common components or themes is up to the end user and requires some domain expertise.?

Semantic Meaning in a Local Context

Standard LDA models elicit meaning at a?global?level. They impute information from word frequencies independently of where they occur in our texts. However, semantic meaning is largely contained in the?local?context – a word’s meaning will depend on either it’s immediate or longer-range neighbours.

While the bag-of-words model can be extended locally by tabulating n-grams, an influential line of work in NLP reframes the global analysis to a local one by measuring each term’s local co-occurrence with other terms. Also known as?word embedding, these models compress the high-dimensional word lists into relatively low-dimensional vectors on co-occurrence across documents to leverage information in a local context.

For co-occurrence at a local level, Word2Vec is perhaps the most influential word embedding model (see GloVe for global co-occurrence). Using both individual words and a sliding window of context words surrounding the individual words, the algorithm either predicts the current word from the surrounding context or vice versa.

Intuitively, the algorithm gives similar representations to words that appear in similar contexts across documents. If researchers have a lot of text data, the algorithm can estimate bespoke embeddings to capture word meanings specific to the application – this is?self-supervised learning.

With smaller datasets, one can use?pre-trained?embeddings estimated on a large, auxiliary corpus (like Wikipedia) and port them to a new application. This strategy is an application of?transfer learning,?which is a methodology in machine learning that focuses on applying the knowledge gained from solving one task to a related task.?This approach is not often used in economics, because generic embeddings may not produce the most useful word representations for economic tasks.

Transformers in NLP

ChatGPT is a transformer-based, pre-trained language model. To see how it works, imagine the following two sentences, where [MASK] refers to an omitted word:

As a leading firm in the [MASK] sector, we hire highly skilled software engineers.

As a leading firm in the [MASK] sector, we hire highly skilled petroleum engineers.

Humans intuitively know which key words to focus on to predict omitted words. In the example, both sentences are the same, except for the words ‘software’ and ‘petroleum.’ These allow us to infer that the omitted words are likely to be ‘IT’ in the first sentence and ‘energy’ in the second.

Word embedding algorithms cannot do this. They weight all words in the context window equally when constructing embeddings. A recent breakthrough in NLP has been to train algorithms to pay attention to relevant features for prediction problems in a context-specific manner.

Self-attention, as this is known, takes a sequence of initial token embedding (from, say, Word2Vec) and shoots out new token embeddings that allow the initial embeddings to interact. The new tokens are weighted averages of the initial tokens, and the weights determine which pairs of tokens interact to form a context-sensitive word embedding.

The attention weights in the self-attention function are estimated by?Transformers?– large neural networks – to successfully perform masked-word prediction, like in the example above. Unlike Recurrent Neural Networks (RNNs) before them, they process the entire textual input all at once, increasing parallelisation and reducing training times.

Generative pre-trained transformers (GPT) are a family of models pre-trained to perform?next-token?prediction on large corpora of generic text (e.g., Wikipedia, Common Crawl, etc.). Another family of well-known models – Bidirectional Encoder Representations from?Transformers (BERT) – instead perform?masked-token?prediction.

Masked-token prediction is effectively what we tried above. DistilBERT produced the following list of words most likely to fit the masked words for the two example sentences (Table 1). As we can see, it does a good job of identifying important information, even when it lies several tokens away from masked words.

No alt text provided for this image

Modern NLP has made large strides forward in understanding semantic meaning in an everyday context. They can, however, also be fine-tuned for supervised learning tasks – that is, updated for prediction in specific contexts. And, because Transformer models have a good general understanding of diverse texts, fine-tuning achieves good performance even with relatively few labelled training samples.

Finishing this section, the authors point out that these models have downsides. Transformers lack transparency, making it impossible to replicate the full estimation pipeline. They also require vast hardware sources, meaning most researchers must begin by downloading previously fitted models and updating them.

Moreover, Transformer models only operate on relatively short documents. This works well for sentences or paragraphs, but not for longer documents such as speeches or corporate filings. For longer documents, it is usually better to use non-Transformer-based alternatives like gradient boosting.

Bottom Line

Text algorithms are unlocking many interesting research questions for economists. The first of this two-part summary on NLP in Economics, provides some structure on how to leverage information in texts. From inputting texts to reducing high-dimensional matrixes, and from equally weighted word embeddings to trained attention weights, the paper helpfully merges the basics with frontier NLP research. I hope you find it helpful…


Sam van de Schootbrugge?is a Macro Research Analyst at Macro Hive, currently completing his PhD in Economics. He has a master’s degree in economic research from the University of Cambridge and has worked in research roles for over 3 years in both the public and private sector. His research expertise is in international finance, macroeconomics and fiscal policy

Umair Usman

Key Account Manager/Certified RTT Practitioner

1 年

looking forward to part 2

回复
Tomas Alfonso, CFA, FRM

Debt Capital Markets | Direct Lending | Derivatives | ESG

1 年

I dont understand half of it...but the half I understood was extremely interesting!

回复
Steven Marshall

CEO & Co-Founder of RISKGRID

1 年

Fascinating piece covering some of the important detail unlike a lot of the "OMG THIS WILL DO EVERYTHING!" type of articles that seem way too common nowadays.

回复

要查看或添加评论,请登录

Bilal Hafeez的更多文章

  • Inflation Monitor: Higher Costs and Weaker Demand?

    Inflation Monitor: Higher Costs and Weaker Demand?

    By Dominique Dwor-Frecaut Summary February CPI surprised on the downside but did not change my big picture view of…

  • The New Trump Trades

    The New Trump Trades

    By Dominique Dwor-Frecaut Summary President Donald Trump aims to reverse the decline in the income share of lower…

    1 条评论
  • Growth at Greater Risks From Supply Shocks Than Tighter Financial Conditions

    Growth at Greater Risks From Supply Shocks Than Tighter Financial Conditions

    By Dominique Dwor-Frecaut Summary Net Fed tightening has not changed financial conditions much and growth remains well…

  • Fed Monitor: Will the Easing Bias End in 2025?

    Fed Monitor: Will the Easing Bias End in 2025?

    By Dominique Dwor-Frecaut Summary Since the 29 January FOMC, data releases have generally been towards increasing…

  • Inflation Monitor: Still High, Still Sticky

    Inflation Monitor: Still High, Still Sticky

    By Dominique Dwor-Frecaut Summary January’s CPI upside surprise did not change my big picture inflation view, stuck at…

  • Higher Tariffs Are Here to Stay

    Higher Tariffs Are Here to Stay

    By Dominique Dwor-Frecaut Summary Tariff increases are likely to proceed on two tracks. The long-term track is broad…

  • FOMC Review: Unchanged Outlook

    FOMC Review: Unchanged Outlook

    By Dominique Dwor-Frecaut Summary The Fed held rates as expected with its economic assessment and policy outlook…

  • FOMC Preview: A Wait and See Hold

    FOMC Preview: A Wait and See Hold

    By Dominique Dwor-Frecaut Summary The Fed will likely hold rates steady next week as inflation and growth remain stable…

  • US Inflation Monitor: Upside Risks Persist Despite Negative CPI Surprise

    US Inflation Monitor: Upside Risks Persist Despite Negative CPI Surprise

    By Dominique Dwor-Frecaut Summary Despite the downside surprise in December’s core CPI, inflation risks are tilted to…

  • US Labour Market Tighter Than Appears

    US Labour Market Tighter Than Appears

    By Dominique Dwor-Frecaut Summary Net hiring, a better indicator of labour market tightness than hiring or quit rates…

社区洞察

其他会员也浏览了