Code Snippet: Parallel LLM Calls
Han Xiang Choong
Senior Customer Architect - APJ @ Elastic | Applied AI/ML | Search Experiences | Delivering Real-World Impact
Problem
Want to use LLM to process very large data (>10^7 documents), want results asap. Minimize time per document.
Setting
SRC: SG
DST: GPT-4o-Mini on Azure OpenAI, US-East Deployment
Scenario
List of 20 prompts generated by an LLM for benchmarking purposes.
prompts = [
"Explain the concept of quantum entanglement to a high school student.",
"Write a short story about a time traveler who accidentally changes history.",
"Describe the process of photosynthesis in plants.",
"Compare and contrast the economic systems of capitalism and socialism.",
"Provide a step-by-step guide on how to change a car tire.",
"Analyze the themes in George Orwell's novel '1984'.",
"Explain the basics of machine learning to a non-technical person.",
"Describe the impact of social media on modern interpersonal relationships.",
"Write a persuasive essay on the importance of renewable energy sources.",
"Summarize the key events of World War II in chronological order.",
"Explain the concept of blockchain technology and its potential applications.",
"Describe the process of natural selection in evolution.",
"Write a dialogue between two characters discussing the ethics of artificial intelligence.",
"Explain the greenhouse effect and its role in climate change.",
"Analyze the pros and cons of remote work in the modern economy.",
"Describe the basic principles of cognitive behavioral therapy.",
"Explain the concept of supply and demand in economics.",
"Write a critical review of a famous painting (e.g., Van Gogh's 'Starry Night').",
"Describe the process of how a bill becomes a law in the United States government.",
"Explain the basics of computer programming to someone with no prior experience."
]
Solution
Parallel execute function takes a function, an iterable of inputs, and kwargs. Aggregates outputs in a list called 'results'.
领英推荐
import os
import traceback
from concurrent.futures import ThreadPoolExecutor, as_completed
def parallel_execute(func, iterable, max_workers=10, **kwargs):
results = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_item = {executor.submit(func, item, **kwargs): item for item in iterable}
for future in as_completed(future_to_item):
try:
result = future.result()
results.append(result)
except Exception as e:
traceback.print_exc()
return results
LLM Class:
class AzureOpenAIClient:
def __init__(self):
self.client = AzureOpenAI(
api_key=os.environ.get("AZURE_OPENAI_KEY_1"),
api_version="2024-06-01",
azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT")
)
def generate(self, prompt, model="gpt-4o-mini", system_prompt=""):
response = self.client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
max_tokens=4096
)
return response.choices[0].message.content
Run:
LLM = AzureOpenAIClient()
results = parallel_execute(
LLM.generate,
prompts,
max_workers=10,
model="gpt-4o-mini",
system_prompt=""
)
Results (20 prompts, no system prompt
Naive Sequential: 112.3 seconds
Parallel, 5 workers: 22.9 seconds
Parallel, 10 workers: 14.8 seconds
Parallel, 20 workers: 8.5 seconds