?? Scaling LLM-Powered Data Labeling: Architecture Lessons from Labelling 5K Work Emails Llama3-70b ??
Akshay Dongare
M.S. in Computer Science at NC State | Generative AI Engineer at Harvard | Ex-Mercor & Tech Mahindra | TensorFlow Certified Developer | Google Cloud Certified Cloud Digital Leader
After successfully labelling 5,028 Enron emails as work/personal using the open-source Large Language Model, Llama3-70b, with Groq 's LPU-powered inference; I want to share battle-tested code patterns AND discuss why VCs are pouring $4.2B into AI data labeling (Scale AI $1B+, Snorkel $135M)! ??
While I worked on The Professional Filter: Machine Learning Approach to Work Email Detection with Ayush Gala and Vidhisha Kamat , I ran into the labor-intensive and costly process of manual data labelling.
After reviewing some literature on the topic, I turned to LLMs for automated data labeling, implementing 4 key innovations:
1?? Idempotent Processing:
Built a crash-resistant system that tracks file positions ensuring idempotence through the use of a last_index.txt file which records the last processed email to prevent reprocessing, enabling 24/7 unattended operation.
def populate_llm_prediction_emails_file(api_key):
# Check if the CSV file already exists to decide on writing header
csv_file_path = './data/llm/llm_prediction.csv'
write_header = not os.path.exists(csv_file_path)
print("Write header: ", write_header)
# Load the last index if exists
last_index_path = './tmp/last_index.txt'
if os.path.exists(last_index_path):
with open(last_index_path, 'r') as file:
last_index = int(file.read().strip())
# Open the CSV file for appending
with open(csv_file_path, mode='a', newline='') as file:
writer = csv.writer(file)
if write_header:
writer.writerow(['label']) # Write the header if needed
for i in range(last_index, len(data)):
try:
prompt = data.message[i][:6000]
response = llama_groq.guarded_groq_call(prompt=prompt, api_key=api_key)
print(response)
# Write the validated output directly to the CSV
writer.writerow([response.validated_output])
# Save the current index
with open(last_index_path, 'w') as index_file:
index_file.write(str(i + 1))
except Exception as e:
print(f"Error processing email at index {i}: {e}")
return False
return True
2?? API Key Rotation:
Developed a smart API key cycling system using 10+ free Groq API keys to bypass rate limits, cut costs and ensure continuous operation:
retry_delay_seconds = 10
api_key_index = 0
while True:
success = populate_llm_prediction_emails_file(llama_groq.api_keys['groq'][api_key_index])
if success:
print("All data processed successfully.")
break
else:
print("An error occurred. Retrying with a different API key in {} seconds...".format(retry_delay_seconds))
time.sleep(retry_delay_seconds) # Delay before retrying
api_key_index = (api_key_index + 1) % len(llama_groq.api_keys['groq'])
3?? Custom LLM API wrapper for Groq:
def my_custom_groq_api(self, api_key=None, **kwargs):
"""Custom LLM API wrapper for Groq.
At least one of messages should be provided.
Args:
messages: The messages to be passed to the Groq API
Returns:
str: The output of the Groq API
"""
messages = kwargs.pop("messages", [])
groq_client = Groq(api_key=api_key)
response = groq_client.chat.completions.create(model="llama3-70b-8192", messages=messages, max_tokens=len(messages)+1, **kwargs)
print(response.choices[0].message.content)
return response.choices[0].message.content
4?? Prompt Engineering and Validation Guardrails:
Implemented strict LLM output validation ensuring only binary classifications (0 or 1) through the use of a system prompt (inspired from the Work Hard, Play Hard: Email Classification on the Avocado and Enron Corpora paper) and Regex pattern (^[01]$) matching validation (which would call the LLM again if the output didn't pass the validation step):
def guarded_groq_call(self, prompt='', api_key=None):
"""Guards the Groq API call to ensure the output is a binary classification."""
guard = Guard().use(
RegexMatch,
regex=r"^[01]$",
on_fail=OnFailAction.REASK
)
validated_response = guard(
self.my_custom_groq_api,
messages=[{"role":"system", "content":"If a message is about a social event inside the company, such as celebrating a new baby of an employee, or a career promotion, it belongs to the first category 1 (work-related). If a message is about a social event outside the company but still related to the company, such as a picnic (usually family members are invited), it belongs to the second category 0 (non-work-related). If a message is about a social event which is not related to the company such as a charity but company employees are encouraged to participate, it belongs to the second category 0 (non-work-related). If a message is too short to determine its category (or even empty), it should have the same category as the message it is responding to, or the message it is forwarding. If a message is ambiguous, try to read other messages in the thread to clarify. If a message is spam or in the rare case that the first message of a thread is very short or empty, say 1 (work-related)"},
{"role":"user", "content":"For the following email, please classify it as 1 (work-related) or 0 (non-work-related). Do not include any other text in your response: " + prompt}],
api_key=api_key
)
?? Results:
?? Key Learning:
?? Essential Papers on LLM Data Labeling
领英推荐
?? Industry-Leading Implementations:
?? Key Takeaways:
?? Isn't training a neural network on LLM-labeled data like a snake eating its own tail?
?? The Ouroboros Concern:
??Three Distinct Degradation Modes in Recursively Trained Models
??Epistemological Challenges in Synthetic Learning
The Ouroboros paradigm raises fundamental questions about the nature of machine knowledge:
Recursive training creates hyperreal epistemologies—self-referential systems of knowledge validation that lack external anchors. This mirrors postmodern critiques of media ecosystems, where the distinction between reality and simulation becomes increasingly blurred.
??My Take
It's not a perfect circle - it's more like a spiral. Each iteration can add value if we:
?What's your take on LLM-powered data labeling?
?Have you implemented similar systems?
?Let me know in the comments!
#AIEngineering #MachineLearning #DataScience #LLM #Groq
5x??Hackathon Winner |??SIH Grand Finalist | AI/ML Domain Winner | AI & Data Science Enthusiast | Skilled in NLP, LLMs, Data Engineering, Servers, Machine Learning Models, Python, and Data-Driven Solutions
1 个月Why you need LLMs for classification?
Software & AI Developer | Content Creator | AI Enthusiast | Innovator | Meme Marketeer
1 个月Love this! Very well articulated Akshay , Can you provide some info onhardware infrastructure used?