Equity fundamental analysis with LLMs
Kaustubh Anwekar
Product Development | Artificial Intelligence | Machine Learning | Director - Product Engineering at LTIMindtree NxT
There is a great buzz around generative AI these days. Introduction of ChatGPT and its meteoric adoption has made LLMs (Large Language Models) the flavour of the season. Everyone is looking to ride the GenAI wave to reap benefits. Possible business benefits lie in automating mundane tasks, generating marketing collaterals, automated emails and Chatbot that talks to humans as real humans do. Everything thus, would result in reduced manual labour, hence gain in productivity and dollars.
LLMs are language models that are trained on a very large corpus of text data, usually scraped from internet. Although they are primarily trained to be generative by predicting the next probable word (like “star” after “Twinkle twinkle little”), LLMs are also great at other natural language processing tasks like text classification, sentiment analysis, question & answering and summarization. LLMs don’t just memorize the most common text they are trained on, they also learn the patterns and semantics to understand text.
The Microsofts, Amazons, Metas and Googles of the world have already done the heavy lifting of training LLMs through big investments in required resources. Most of these LLMs are available for consumption through open source or are reasonably priced through their cloud offerings.
With tons of text data getting generated every second and enterprises digitizing their documents for ease of access (and being sustainable), it’s no wonder that everyone wants their share of value from easily accessible LLMs. The greatest value we can derive from LLMs is where text is either generated or consumed in huge volumes. One such area is fundamental analysis in Equity investing where research analysts comb through tons of text reading the Red Herring Prospectus (RHP), annual reports, earnings call transcripts, online articles, credit reports, other analysts’ views and what not to study the worthiness of a company before investing in it. Being laborious, time consuming and vulnerable to human misses, we decided to attack this area using LLMs.
We chose earnings call transcripts as our source to experiment for 2 reasons. Management provides the updates during conference calls and hence we hear facts directly from horse’s mouth; Secondly, management answers questions raised by individual and institutional investors, which makes them address more investor specific concerns, hence deeper and more useful insights are available for mining.
We tried finding headwinds (factors hindering business) and tailwinds (factors encouraging business) using LLMs. We use Python 3.10 and Jupyter Notebook for our experiments. Let’s begin.
Install all required libraries
!pip install langchain
!pip install openai
!pip install pypdf
!pip install tiktoken
!pip install chromadb
!pip install lark
Initialize all library references
import os
import openai
import sys
sys.path.append('../..')
openai.api_key = os.environ['OPENAI_API_KEY']
Load the pdf documents
We chose the Q1 FY 24 earnings call transcript of Gujarat Fluorochemicals Ltd. and Hero Motocorp Ltd. We used LangChain to get our answers.
This step may take some time to execute, depending on the size of documents.
from langchain.document_loaders import PyPDFLoader
loaders = [
PyPDFLoader("../Documents/GFL_Q1_FY24.pdf"),
PyPDFLoader("../Documents/HeroMoto_Q1_FY24.pdf")
]
docs = []
for loader in loaders:
docs.extend(loader.load())
print("Documents loaded successfully!")
Verify if the documents loaded correctly
len(docs)
doc = docs[0]
print(doc.page_content[0:500])
Each individual page in pdf is a doc in docs.
Split loaded documents into multiple chunks
We split text on each page of the document into chunks of 1000 words with an overlap of 100 words. Smaller chunks help in narrowing the text corpus where an answer is expected. Overlap allows the context to be maintained between neighbouring chunks.
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = 1000
chunk_overlap = 100
r_splitter = RecursiveCharacterTextSplitter(
chunk_size = chunk_size,
chunk_overlap = chunk_overlap,
separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
print("r_splitter initialized successfully!")
splits = r_splitter.split_documents(docs)
print("docs split successfully!")
Validate the split
len(splits)
splits[23]
Initialize embeddings
In simple terms, Embeddings are numeric representations of text (words, sentences or paragraphs) that maintain the semantics of the text. Embeddings of text with similar meanings will be closer to each other (i.e. the difference between them will be minimal). The distance between embeddings in measured with different techniques, one of them is called “cosine similarity”.
E.g. Embeddings for “Cat” and “Dog” will be closer than the embeddings for “Cat” and “Tractor”. Similarly, embeddings for “She is a sprinter” and “She runs fast” will have a very high value for cosine similarity.
from langchain.embeddings.openai import OpenAIEmbeddings
openai_api_type = "azure"
openai_api_ver = "2022-12-01"
openai_api_key = openai.api_key
embeddings_deployment_name = "<your_embeddings_deployment_name>"
deployment_name = "<your_deployment_name>"
embedding = OpenAIEmbeddings(deployment=embeddings_deployment_name,
openai_api_key=openai_api_key,
chunk_size=1,
openai_api_version=openai_api_ver,
openai_api_type=openai_api_type)
print("embedding initialized successfully!")
Convert text chunks to embeddings and store in vector database
This method will convert all text chunks (split above) into embeddings and save in vector database. Whenever user asks a question, LangChain will convert the question text into embeddings and fetch the embeddings that are semantically closest to the embeddings of the question text. This way it will try to find an answer in the most relevant part of the document which has the highest probability of having the right answer.?
This step may take a long time, depending on the size and count of text chunks. Once this step is complete, you will find a file created in the “VectorDB” folder.
from langchain.vectorstores import Chroma
persist_directory = "../VectorDB"
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory=persist_directory
)
print("vectordb initialized successfully!")
For all subsequent runs, you may simply initialize the vectordb instead of creating the embeddings again, as those are already created and saved in the folder. Use the following command in place to the above.
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
Use LangChain’s RetrievalQA chain to get results from retrieved chunks using LLM
import json
from langchain.chains import RetrievalQA
from langchain.llms import AzureOpenAI
llm = AzureOpenAI(deployment_name=deployment_name, openai_api_version=openai_api_ver, openai_api_key=openai_api_key, temperature=0)
# Help method to extract contents from the file
# instruction - Should contain question to be asked
# insight - Should include what insight is expected from the result- e.g. "Headwind", "Tailwind", etc.
# bool_return_json - Should include Tru or False
# If True, method returns result in JSON format, else returns as dictionary. Default is True
def extract_contents_without_prompt_engg(instruction, insight, bool_return_json=True):
dict_all_results = {}
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever = vectordb.as_retriever(search_kwargs={'k':3}),
return_source_documents=True
)
result = qa_chain({'query':instruction})
dict_all_results[insight] = result['result'].split('\n')
if bool_return_json:
json_all_results = json.dumps(dict_all_results)
return json_all_results
else:
return dict_all_results
instruction = "What are the headwinds for Gujarat Fluorochemicals?"
insight = "Headwinds"
json_insight = extract_contents_without_prompt_engg(instruction, insight, True)
print(json_insight)
Result
{"Headwinds": [" The headwinds for Gujarat Fluorochemicals are the macro headwinds impacting their current growth, such as destocking in their major businesses and geographies."]}
Slightly updated prompt below; Observe the difference in result. You may explore results creating different prompts using Prompt Engineering.
instruction = "What are the headwinds for Gujarat Fluorochemicals? Separate each response with '|'."
insight = "Headwinds"
json_insight = extract_contents_without_prompt_engg(instruction, insight, True)
print(json_insight)
Result
{"Headwinds": [" Destocking|Macro headwinds|PFAS regulation in EU reach"]}
Let's check the tailwinds for Hero Motocorp.
instruction = "What are the tailwinds for Hero Motocorp? Separate each response with '|'."
insight = "Tailwinds"
json_insight = extract_contents_without_prompt_engg(instruction, insight, True)
print(json_insight)
Result
{"Tailwinds": [" Increased demand for two-wheelers | Growing economy | Expansion of product portfolio | Improved customer service"]}
How we plan to enhance this further
We just saw how LLMs, LangChain and Prompt Engineering can help us gain insights from unstructured text. With better pre-training, finetuning and prompt engineering, LLMs can work wonders on other operations which involve large text corpus, like Legal documents, Insurance underwritings, etc.
Sr Architect at TransUnion (TOGAF? 9 Certified)
1 年Very well explained Kaustubh
PhD in CSE (JU) || Product Owner || Gen AI Practitioner || Director @LTIMindtree|| Dedicated Researcher in Data Science, Gen AI || Mentor || Patents on AI/DS/Gen AI
1 年Nice one Kaustubh! Beautifully explained all the steps!
6 Patents filed on AI/Gen AI||M.Tech (BITS Pilani)||Technology Leader||Technical Author||Story Teller||AI Strategist||Trainer||Programmer
1 年Very interesting Kaustubh Anwekar
Computational Fluid Dynamics Consultant #Electromobility, MATLAB, Simulink
1 年Excellent Kaustubh!