How To Summarize Public Opinion Using RAG AI
Keith McNulty
Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect
Having now spent almost two years being exposed to the new generation of generative models (starting with chatGPT), we are starting to accept the fact that these have limited value to us on their own, without being enhanced with other information. It’s becoming clearer that the value lies in how we integrate large amounts of proprietary information with the natural language generation capabilities of these models to achieve effective and efficient ways of giving us digestible summaries of that information.
Retrieval Augmented Generation (RAG) is a growing architecture that is simple and inexpensive, and is increasingly being used to experiment with improving the quality of informatiom returned by large language models. Described simply, proprietary documents are stored in a secure vector embedding database. Instead of a prompt (question) being sent directly to an LLM, the prompt is first compared with the documents in the proprietary database, and a number of closely matching documents are retrieved and added to the original question. A prompt is constructed to reflect the original question and the proprietary document information, and is sent to the LLM to generate an answer to the question based on the proprietary information. A simple RAG architecture can be seen in this diagram, and you can see my previous article where I used a RAG architecture to build a simple Q&A engine around my statistics textbook .
One potentially valuable RAG use case is where varied opinion provided in a large number of documents needs to be summarized into a digestible form. For example, in politics, the ability to be able to ask a question on a specific topic and obtain a summarized view of what the public or a voter group think about it could be very valuable. In business settings, many organizations struggle with how to make productive use of the large amount of text that comes in from customer or employee surveys. The ability to ask a question and get a summarized response based on this text would be a very valuable source of targeted intelligence.
Getting a summary of what New York Times readers think
In this tutorial I will demonstrate how to build this sort of application by taking a large data set of reader comments to articles in the New York Times during 2017 and 2018. I will embed all these comments into a vector database and then build a pipeline to allow us to ask what NY Times readers think about a specific topic, drawing on all these comments and using a variety of options for LLMs to summarize them.
The data I am going to use is publicly available on Kaggle here , and contains over 2.1 million comments left on articles by readers in various months in 2017 and 2018. The steps I will go through are as follows:
If you don’t want to read through all the details of this tutorial, you can find all my code here — go forth and build!
Step 1: Getting the data from Kaggle and preparing it for loading into a Vector DB
The first step is basic data processing. There are a lot of files, so to make this easy, we will get the data files we need via the Kaggle API and then reduce them to just the files that contain user comments. To do this using the code below, you’ll need to obtain your Kaggle API key via your Kaggle account and store it in a kaggle.json file in your project root.
import pandas as pd
import os
import glob
import opendatasets as od
# dataset URL
dataset = 'https://www.kaggle.com/datasets/aashita/nyt-comments/'
# Using opendatasets let's download the data sets (480 MB)
od.download(dataset)
# downloaded folder contains many article csv files - we are not interested in them
# remove article csvs to leave just comments csvs
for f in glob.glob("nyt-comments/Article*"):
os.remove(f)
# load all 2017 comment csv files into one single dataframe
# Get a list of all CSV files in a directory
csv_files_2017 = glob.glob('nyt-comments/*2017.csv')
Next, we will create a single Pandas DataFrame containing only the comment text from all of the comment files, and also the year in which the comment was made (we will use the year as a metadata to filter on later). Finally, we will save this large Pandas DataFrame as a pickle file to allow us to easily load it up for the next step:
# Create an empty dataframe to store the combined data
combined_df_2017 = pd.DataFrame()
# Loop through each CSV file and append its contents to the combined dataframe
for csv_file in csv_files_2017:
df = pd.read_csv(csv_file)
combined_df_2017 = pd.concat([combined_df_2017, df])
# add a column with year
combined_df_2017.loc[:, "year"] = 2017
# select only year and comment body
comments_2017 = combined_df_2017[["year", "commentBody"]]
# repeat for 2018 comments
csv_files_2018 = glob.glob('nyt-comments/*2018.csv')
combined_df_2018 = pd.DataFrame()
for csv_file in csv_files_2018:
df = pd.read_csv(csv_file)
combined_df_2018 = pd.concat([combined_df_2018, df])
combined_df_2018.loc[:, "year"] = 2018
comments_2018 = combined_df_2018[["year", "commentBody"]]
# combine into single df with year and comment
comments = pd.concat([comments_2017, comments_2018])
# write to pickle
comments.to_pickle("comments.pickle")
Step 2: Creating a Vector DB to store all our reader comments
Now we will set up a Vector DB and load our comments into it. A Vector DB uses a selected embedding model to store a document as a vector embedding. It will also store the text and any associated metadata.
In this case I will use a Chroma DB to set up a vector database on my local machine. This is only recommended if you have a powerful machine (I use a Macbook M3 Max), otherwise you should consider setting up a ChromaDB cloud database for this experiment, or using a fully managed cloud vector DB such as Pinecone.
First, we load the data from our prior pickle file and remove empty or very short comments from the dataset. Then we load the comments as langchain document objects, which will include the year of the comment as a metadata field.
from langchain_community.document_loaders import DataFrameLoader
import pandas as pd
import numpy as np
import chromadb
from chromadb.utils import embedding_functions
from chromadb.utils.batch_utils import create_batches
import uuid
# load comments df and filter out short or empty comments
comments = pd.read_pickle("comments.pickle")
comments = comments[comments['commentBody'].notnull()]
comments['COUNT'] = [np.char.count(comment, ' ') for comment in comments['commentBody']]
longer_comments = comments[comments['COUNT'] >= 40]
# load into langchain document format
loader = DataFrameLoader(longer_comments, page_content_column="commentBody")
docs = loader.load()
Next we set up a Chroma DB in our local project directory. We select a model to covert comments into embeddings. We then define our document collection, giving each comment a unique ID, and including the year as a metadata field, and using cosine similarity as the distance function that will be used to determine which are the most relevant documents to our query. On my advanced machine, it took several hours to embed and load all this into ChromaDB. You can simplify and make quicker by further filtering the documents that you load (for example, you can restrict the comments to a specific month).
# set up the ChromaDB
CHROMA_DATA_PATH = "./chroma_data/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "article_comments"
client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)
# uncomment in case docs have already been written
# client.delete_collection(COLLECTION_NAME)
# enable the DB using Cosine Similarity as the distance metric
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name=EMBED_MODEL
)
# create document collection in ChromaDB
collection = client.create_collection(
name=COLLECTION_NAME,
embedding_function=embedding_func,
metadata={"hnsw:space": "cosine"},
)
# chromadb has a batch size limit for writing
# create batches with year as metadata, and random UUID
batches = create_batches(
api=client,
ids=[f"{uuid.uuid4()}" for i in range(len(docs))],
documents=[doc.page_content for doc in docs],
metadatas=[{'year': docs[k].metadata['year']} for k in range(len(docs))]
)
# write batches to chromaDB - over 2M comments n batches of ~40K
# for each comment, chromaDB will store the comment, year, and embedding
# one time write - this will take a while
# if you are impatient you can cut down the number of comments (eg choose a specific month)
for batch in batches:
print(f"Adding batch of size {len(batch[0])}")
collection.add(ids=batch[0],
documents=batch[3],
metadatas=batch[2])
This created a ChromaDB that was about 12GB on my machine. Once this is all done, we can test by sending a query to our ChromaDB. The DB will use the same embedding model to embed the query and then use cosine similarity to fine the closest matching user comments in the database.
In this test query, we ask a question about user opinions on US foreign policy towards North Korea, and we request the ten closest comments made in 2017.
领英推荐
# test data
results = collection.query(
query_texts=["What are readers opinions on US foreign policy towards North Korea?"],
n_results=10,
include=['documents'],
where={'year': 2017}
)
# convert results to nice dataframe
results_df = pd.DataFrame(results['documents']).transpose()
results_df.columns = ['Comment']
print(results_df)
This produces the following (truncated) comments:
Step 3: Creating a function that executes a RAG pipeline
Now that all our readers’ comments are loaded into a vector DB, we now just have to write a pipeline that takes a question, collects matching comments from the database, and then constructs a prompt to send to an LLM of our choice to receive a summarized response.
To make this easy, let’s decouple the prompt construction from the LLM interaction. In this function we take a set of documents which will be user comments, and we take a question (which is the original prompt). We number the comments and then construct a prompt into which we will insert the original question and the numbered comments — this function will output the constructed prompt to be sent to the LLM.
# helper function for prompt construction
def construct_prompt(docs: dict, question: str) -> str:
# convert the docs into a numbered list of comments
results_df = pd.DataFrame(docs['documents']).transpose()
results_df.columns = ['Comment']
results_df['ComNum'] = [str(i) for i in range(1, len(results_df) + 1)]
results_df['Numbered Comments'] = results_df['ComNum'] + '. ' + results_df['Comment']
# Collect the results in a context
context = "\n".join([r for r in results_df['Numbered Comments']])
# construct prompt
prompt = f"""
Answer the following question: {question}.
Refer only to the following numbered list of comments from NY Times readers when answering: {context}.
Check each numbered comment very carefully and ignore it if it does not contain language that is a close match to the original question.
Provide as much information as possible in the summary, subject to the conditions already given.
Begin your answer with 'Based on the responses from selected NY Times readers', and try to give a sense of majority and minority opinions on the topic, but only if there is an identifiable majority opinion.
If there is not enough information provided to give a summarized opinion, indicate that this is the case.
"""
return prompt
Now we can write a function for the overall pipeline, which will take a question, an LLM client, a database client, as well as parameters to define the number of matching documents requested and any required filters. The function will then find the matching comments from the database, use our construct_prompt() function to combine everything and send it to the LLM client to obtain the final summarized result.
Here is an example of a function which calls GPT-4. This function will be limited by GPT-4’s context window of 8192 tokens, meaning that selecting too many comments may break the function. Note also that I have hidden my OpenAI credentials as environment variables in this code, and I suggest you do the same, especially if you intend to share your code:
# packages
from dotenv import load_dotenv
import openai
import os
# load env variables
load_dotenv()
openai_base_url = os.getenv('OPENAI_BASE_URL')
openai_api_key = os.getenv("LLM_TOKEN")
# chromadb location
CHROMA_DATA_PATH = "../chroma_data/"
collection_db="article_comments"
# Initialize an OpenAI client
client = openai.OpenAI(api_key=openai_api_key, base_url=openai_base_url)
# initialize a chroma client
chroma_client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)
# function to execute RAG pipeline using GPT-4
def ask_question_openai(question:str, client = client,
collection: chromadb.PersistentClient() = collection_db,
n_docs:int = 30, filters: dict ={}) -> str:
# Find close documents in chromadb
collection = chroma_client.get_collection(collection)
results = collection.query(
query_texts=[question],
n_results=n_docs,
where=filters
)
prompt = construct_prompt(results, question)
# send prompt to GPT-4
chat_completion = client.chat.completions.create(
messages=[{
"role": "user",
"content": prompt,
}],
model="gpt-4",
)
# display response
print(chat_completion.choices[0].message.content)
Let’s test our function by asking it to summarize the top 30 matching user comments related to our original question on US foreign policy towards North Korea:
# test function
ask_question_openai("What do readers think about US foreign policy towards North Korea?")
Based on the responses from selected NY Times readers, there appear to be diverse views on US foreign policy towards North Korea, with no identifiable majority opinion. Some readers argue that existing U.S foreign policy towards North Korea is ineffective and shortsighted, claiming that current and past administrations have handled the situation poorly (Comments 7, 8, 10, 16, 20, 23). Some readers believe it’s time to reassess that policy and emphasize diplomacy rather than military action (Comments 2, 15, 29).
Another group of readers, however, supports a more assertive stance against North Korea, suggesting that it's wise to confront North Korea's nuclear aspirations now before they potentially become a greater threat (Comments 12, 13, 27). There's also a sentiment among some readers that the South Koreans, since they are most at risk, should take the lead in dealing with North Korea (Comments 6, 22).
Other comments highlight a lack of trust in the Trump administration's handling of the situation, saying it has either been inconsistent (Comments 18, 21) or has contributed to escalating tensions (Comments 3, 4). There's also a perception that North Korea's recent overtures are not genuine and are merely a tactical shift (Comments 24, 26).
In summary, the reader opinions are diverse and reflect a range of viewpoints on how the U.S should approach North Korea.
Similarly we can set up some local smaller language models on our machine to try the same RAG pipeline (see the corresponding code for this here ). Note that we do not have to concern ourselves with token limits here. For example, here’s the response I get asking the same question to Llama3’s 70B parameter model using ollama on my machine, asking it to consider 100 matching comments:
# test function
ask_question_local("What do readers think about US foreign policy towards North Korea?", llm = llama3, n_docs = 100)
Based on the responses from selected NY Times readers, many are skeptical about the effectiveness of US foreign policy towards North Korea. A significant number of respondents believe that North Korea's leadership will never willingly give up its nuclear weapons or affiliated programs, citing the regime's existential need for a nuclear capability and its history of belligerence.
Some readers suggest that the onus is on the rest of the civilized world to contain North Korea through means such as developing anti-missile systems, squeezing Pyongyang financially, humiliating its leaders with sanctions, and developing cyber capabilities. A few respondents also propose more drastic measures, including assassinating NK scientists associated with the missile program and sending Navy SEALs to find secret tunnels and labs.
However, a minority opinion suggests that South Korea appears to be taking a more level-headed approach to negotiations, with some readers praising the country's leadership for being the "adults" in the situation. A few respondents also propose a slower and more steady pace of negotiation and engagement, citing the effectiveness of economic sanctions in weakening Kim Jong-un's power.
Overall, there is no clear majority opinion on the most effective approach to US foreign policy towards North Korea, but a significant number of respondents express skepticism about the regime's willingness to negotiate in good faith.
Step 4: Building a front-end to this application using Streamlit
Now we’ve really done all the hard work in building the back-end RAG pipeline, it’s pretty easy to use a bit of Steamlit code to serve this up with a simple front end. I won’t go into intricate detail here, but I have packaged all my prior clients, objects and functions into a module which I have called ask_question which you can find here .
Then I use streamlit to serve up a simple application. The application has some options, including a choice of language model to use — I’ve added in Anthropic’s Claude 3 Opus as an additional LLM for good measure here. You can find the streamlit code here .
Let’s ask Claude our question on North Korea, given that its very large cotext window allows us to use a lot of user comments:
Let’s ask Claude what readers think about the music of Taylor Swift:
What do you think of this kind of RAG application to summarize the opinions of readers, the general public, voters, customers or employees? Feel free to comment!
I/O Psychologist- People analytics consultant for people and talent data
4 个月I really like how the first summary includes the comment number as a source citation. How was that done / is it reproducible? One area I’m finding in RAG for survey comments is that most topical responses may be very simple and few people leave detailed responses (Example most people just say compensation and not the reasons why they said compensation). As a result summarization seems to over rotate on these fewer but more detailed responses in the summary. So it seems like it’s a broad sentiment but it’s only informed by one comment. I think this source citation technique could help make that clear when reading a summary.
Analytical Leader, Creative Catalyst
4 个月Bravo! How hard would this be to implement on say, YouTube comments? Existing RAG only seems to grok transcripts.