Equity fundamental analysis with LLMs
Image credit: ? sasirin pamai—iStock/Getty Images

Equity fundamental analysis with LLMs

There is a great buzz around generative AI these days. Introduction of ChatGPT and its meteoric adoption has made LLMs (Large Language Models) the flavour of the season. Everyone is looking to ride the GenAI wave to reap benefits. Possible business benefits lie in automating mundane tasks, generating marketing collaterals, automated emails and Chatbot that talks to humans as real humans do. Everything thus, would result in reduced manual labour, hence gain in productivity and dollars.

LLMs are language models that are trained on a very large corpus of text data, usually scraped from internet. Although they are primarily trained to be generative by predicting the next probable word (like “star” after “Twinkle twinkle little”), LLMs are also great at other natural language processing tasks like text classification, sentiment analysis, question & answering and summarization. LLMs don’t just memorize the most common text they are trained on, they also learn the patterns and semantics to understand text.

The Microsofts, Amazons, Metas and Googles of the world have already done the heavy lifting of training LLMs through big investments in required resources. Most of these LLMs are available for consumption through open source or are reasonably priced through their cloud offerings.

With tons of text data getting generated every second and enterprises digitizing their documents for ease of access (and being sustainable), it’s no wonder that everyone wants their share of value from easily accessible LLMs. The greatest value we can derive from LLMs is where text is either generated or consumed in huge volumes. One such area is fundamental analysis in Equity investing where research analysts comb through tons of text reading the Red Herring Prospectus (RHP), annual reports, earnings call transcripts, online articles, credit reports, other analysts’ views and what not to study the worthiness of a company before investing in it. Being laborious, time consuming and vulnerable to human misses, we decided to attack this area using LLMs.

We chose earnings call transcripts as our source to experiment for 2 reasons. Management provides the updates during conference calls and hence we hear facts directly from horse’s mouth; Secondly, management answers questions raised by individual and institutional investors, which makes them address more investor specific concerns, hence deeper and more useful insights are available for mining.

We tried finding headwinds (factors hindering business) and tailwinds (factors encouraging business) using LLMs. We use Python 3.10 and Jupyter Notebook for our experiments. Let’s begin.

Install all required libraries

!pip install langchain
!pip install openai
!pip install pypdf
!pip install tiktoken
!pip install chromadb
!pip install lark        

Initialize all library references

import os
import openai
import sys

sys.path.append('../..')
openai.api_key = os.environ['OPENAI_API_KEY']        

Load the pdf documents

We chose the Q1 FY 24 earnings call transcript of Gujarat Fluorochemicals Ltd. and Hero Motocorp Ltd. We used LangChain to get our answers.

This step may take some time to execute, depending on the size of documents.

from langchain.document_loaders import PyPDFLoader
loaders = [
    PyPDFLoader("../Documents/GFL_Q1_FY24.pdf"),
    PyPDFLoader("../Documents/HeroMoto_Q1_FY24.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

print("Documents loaded successfully!")        

Verify if the documents loaded correctly

len(docs)

doc = docs[0]
print(doc.page_content[0:500])        

Each individual page in pdf is a doc in docs.

Split loaded documents into multiple chunks

We split text on each page of the document into chunks of 1000 words with an overlap of 100 words. Smaller chunks help in narrowing the text corpus where an answer is expected. Overlap allows the context to be maintained between neighbouring chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = 1000
chunk_overlap = 100

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
print("r_splitter initialized successfully!")

splits = r_splitter.split_documents(docs)
print("docs split successfully!")        

Validate the split

len(splits)

splits[23]        

Initialize embeddings

In simple terms, Embeddings are numeric representations of text (words, sentences or paragraphs) that maintain the semantics of the text. Embeddings of text with similar meanings will be closer to each other (i.e. the difference between them will be minimal). The distance between embeddings in measured with different techniques, one of them is called “cosine similarity”.

E.g. Embeddings for “Cat” and “Dog” will be closer than the embeddings for “Cat” and “Tractor”. Similarly, embeddings for “She is a sprinter” and “She runs fast” will have a very high value for cosine similarity.

from langchain.embeddings.openai import OpenAIEmbeddings
openai_api_type = "azure"
openai_api_ver = "2022-12-01"
openai_api_key = openai.api_key
embeddings_deployment_name = "<your_embeddings_deployment_name>"
deployment_name = "<your_deployment_name>"

embedding = OpenAIEmbeddings(deployment=embeddings_deployment_name,
                            openai_api_key=openai_api_key,
                            chunk_size=1,
                            openai_api_version=openai_api_ver,
                            openai_api_type=openai_api_type)

print("embedding initialized successfully!")        

Convert text chunks to embeddings and store in vector database

This method will convert all text chunks (split above) into embeddings and save in vector database. Whenever user asks a question, LangChain will convert the question text into embeddings and fetch the embeddings that are semantically closest to the embeddings of the question text. This way it will try to find an answer in the most relevant part of the document which has the highest probability of having the right answer.?

This step may take a long time, depending on the size and count of text chunks. Once this step is complete, you will find a file created in the “VectorDB” folder.

from langchain.vectorstores import Chroma
persist_directory = "../VectorDB"

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

print("vectordb initialized successfully!")        

For all subsequent runs, you may simply initialize the vectordb instead of creating the embeddings again, as those are already created and saved in the folder. Use the following command in place to the above.

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)        

Use LangChain’s RetrievalQA chain to get results from retrieved chunks using LLM

import json
from langchain.chains import RetrievalQA
from langchain.llms import AzureOpenAI

llm = AzureOpenAI(deployment_name=deployment_name, openai_api_version=openai_api_ver, openai_api_key=openai_api_key, temperature=0)

# Help method to extract contents from the file

# instruction - Should contain question to be asked
# insight - Should include what insight is expected from the result- e.g. "Headwind", "Tailwind", etc.
# bool_return_json - Should include Tru or False
    # If True, method returns result in JSON format, else returns as dictionary. Default is True

def extract_contents_without_prompt_engg(instruction, insight, bool_return_json=True):
    dict_all_results = {}
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever = vectordb.as_retriever(search_kwargs={'k':3}),
        return_source_documents=True
    )
    result = qa_chain({'query':instruction})
    dict_all_results[insight] = result['result'].split('\n')
    
    if bool_return_json:
        json_all_results = json.dumps(dict_all_results)
        return json_all_results
    else:
        return dict_all_results        
instruction = "What are the headwinds for Gujarat Fluorochemicals?"
insight = "Headwinds"

json_insight = extract_contents_without_prompt_engg(instruction, insight, True)

print(json_insight)        

Result

{"Headwinds": [" The headwinds for Gujarat Fluorochemicals are the macro headwinds impacting their current growth, such as destocking in their major businesses and geographies."]}        

Slightly updated prompt below; Observe the difference in result. You may explore results creating different prompts using Prompt Engineering.

instruction = "What are the headwinds for Gujarat Fluorochemicals? Separate each response with '|'."
insight = "Headwinds"

json_insight = extract_contents_without_prompt_engg(instruction, insight, True)

print(json_insight)        

Result

{"Headwinds": [" Destocking|Macro headwinds|PFAS regulation in EU reach"]}        

Let's check the tailwinds for Hero Motocorp.

instruction = "What are the tailwinds for Hero Motocorp? Separate each response with '|'."
insight = "Tailwinds"

json_insight = extract_contents_without_prompt_engg(instruction, insight, True)

print(json_insight)        

Result

{"Tailwinds": [" Increased demand for two-wheelers | Growing economy | Expansion of product portfolio | Improved customer service"]}        

How we plan to enhance this further

  • Target annual reports that companies publish at the end of each financial year to gather more insights with relevant prompts
  • Use TableGPT or other deep learning ways to extract tabular data from pdfs
  • Try to chunk data logically as per the sections of the document, so specific section can be targeted to get the response
  • Finetuning LLM with annotated data to get more accurate and deeper insights


We just saw how LLMs, LangChain and Prompt Engineering can help us gain insights from unstructured text. With better pre-training, finetuning and prompt engineering, LLMs can work wonders on other operations which involve large text corpus, like Legal documents, Insurance underwritings, etc.

Milind Shinde

Sr Architect at TransUnion (TOGAF? 9 Certified)

1 年

Very well explained Kaustubh

回复
Anindita Desarkar, PhD

PhD in CSE (JU) || Product Owner || Gen AI Practitioner || Director @LTIMindtree|| Dedicated Researcher in Data Science, Gen AI || Mentor || Patents on AI/DS/Gen AI

1 年

Nice one Kaustubh! Beautifully explained all the steps!

Vishwanathan Raman

6 Patents filed on AI/Gen AI||M.Tech (BITS Pilani)||Technology Leader||Technical Author||Story Teller||AI Strategist||Trainer||Programmer

1 年

Very interesting Kaustubh Anwekar

Nitin C.

Computational Fluid Dynamics Consultant #Electromobility, MATLAB, Simulink

1 年

Excellent Kaustubh!

要查看或添加评论,请登录

Kaustubh Anwekar的更多文章

  • 20 lessons in 20 years

    20 lessons in 20 years

    Twenty years ago, IT happened to me. IT "happened", as it was never "planned".

    16 条评论
  • What will stay human in the age of AI

    What will stay human in the age of AI

    Last two and a half decades have seen phenomenal development in technology at an unprecedented pace. We can’t imagine a…

    6 条评论
  • Less is More

    Less is More

    I recently attended a live music show named- The Burmans. Some interesting anecdotes shared by the compere made some…

    1 条评论
  • Skills that matter

    Skills that matter

    Ideas to excel Hard work has always been considered a pre-requisite for professional success. However, we seldom see…

    5 条评论

社区洞察