Can you withdraw a deposit match.Enjoy Free 888+200 Daily Legal Bonus

By Paul Strebenitzer

In the previous parts (part one and part two) of this blog series, we explored the challenges facing DevOps today, how AI can address them, and how to build powerful AI applications using frameworks like LangChain. Now, in this final part, we'll look at infrastructure options for hosting AI applications, optimising their performance, enabling guardrails for secure interaction, and using AI agents to automate complex workflows.

Hosting Options for AI Applications

When deploying AI applications, one of the first decisions is choosing the hosting option for our language model. There are two main approaches: cloud-based models and self-hosted models.

Cloud-Based Models

Cloud-based models are provided by companies like OpenAI, Anthropic, Azure AI, and even more cloud providers. These services are popular for their ease of use and scalability.

Advantages:

Ease of Use: We simply sign up, connect the app to the service API, and we're ready to go.
Scalability: Handle anything from small projects to massive workloads.
Regular Updates: Providers roll out new features and improvements regularly.

Considerations:

Internet Dependency: Requires a stable internet connection.
Limited Control: We're tied to a third-party service, which could mean limited control over the model.
Data Privacy: Sensitive data like passwords or user information may require additional safeguards.

Self-Hosted Models

Self-hosting involves running the model on our own infrastructure using tools like Ollama, Llama.cpp, or LM Studio.

Advantages:

Full Control: Customize the model and data flow to suit your needs.
Enhanced Privacy: Data stays within our environment.

Challenges:

Technical Expertise: Requires knowledge to set up and maintain the system.
Hardware Requirements: Needs appropriate hardware, such as GPUs.
Maintenance Responsibility: We’re responsible for updates and smooth operation.

Decision Factors:

Security Needs: For sensitive data, self-hosting offers better control. If using cloud-based models, we might consider anonymizing or hashing data.
Scalability: Cloud solutions are better for large or unpredictable workloads.
Cost: It's a good idea to compare cloud subscription costs with the expenses of maintaining local infrastructure.

Optimizing Performance: Inference Speed

Inference speed refers to how quickly a model processes and generates responses. Several factors influence this:

Hardware Acceleration: Using GPUs or Tensor Processing Units (TPUs) can significantly speed up inference through parallelized computations. This is especially useful for large-scale or complex applications, as CPUs are slower due to sequential processing.
Model Size: LLMs contain billions of parameters. Smaller models generally provide faster inference times. While they may sacrifice some accuracy, they are practical for real-time or resource-constrained environments.
Quantization: Quantization reduces the precision of model weights (e.g., from 32-bit to 8-bit), improving speed and reducing memory usage with minimal performance loss.
Caching: Caching stores common responses or intermediate results, saving computation time for repeated queries and improving efficiency.

Managing Multiple Models: Language Model Proxying

If the application requires different models for various tasks, language model proxying can help. This technique intelligently routes requests to specific models based on predefined factors or tasks.

LiteLLM: Simplifying Multi-Model Management

LiteLLM is an open-source framework designed to simplify working with multiple language models. It provides a standardized API to call over 100 different LLMs, such as OpenAI, Anthropic, Google Gemini, and Hugging Face.

Benefits:

Unified Interface: Consistent response formatting across providers.
Simplified Management: We can easily switch between models without rewriting code.
Advanced Features: Includes automatic retry, fallback mechanisms, and spend tracking.

Example: Using LiteLLM for Multi-Model Applications

Here’s how we can use LiteLLM to manage multiple LLMs in a single application:

 1import streamlit as st
 2from litellm import completion
 3
 4st.title("Multi-Model Chat")
 5
 6# LiteLLM Completion function to get model response
 7def get_model_response(model_name: str, prompt: str) -> str:
 8  response = completion(model=model_name, messages=[{"role": "user", "content": prompt}])
 9  return response.choices[0].message.content
10
11# Streamlit UI Selection for model
12model_option = st.selectbox("Choose a language model:", ("gpt-3.5-turbo", "ollama/llama2", "gpt-4o"))
13
14# Chat history session state
15if 'chat_history' not in st.session_state:
16  st.session_state['chat_history'] = []
17
18user_input = st.text_input("You:")
19
20# Send user input and get model response
21if st.button("Send") and user_input:
22  st.session_state['chat_history'].append({"role": "user", "content": user_input})
23  with st.spinner("Thinking..."):
24    response = get_model_response(model_name=model_option, prompt=user_input)
25  st.session_state['chat_history'].append({"role": "model", "content": response})
26
27for message in reversed(st.session_state['chat_history']):
28  st.write(f"{message['role'].capitalize()}: {message['content']}")

In this example, we create a simple chat application that allows users to choose from different language models. The get_model_response function sends the user input to the selected model and returns the response. The chat_history session state retains the conversation history, and the Streamlit interface displays the chat messages in an interactive web UI.

Monitoring LLM Applications

Once our AI application is running, monitoring its performance is critical to ensure reliability and efficiency.

Key Metrics to Monitor

Performance Metrics:

Response Time: How quickly the model generates a response.
Throughput: Number of requests handled in a given time.
Latency: Time to first token and total generation time.
Error Rates: Track failed requests or timeouts to identify instability.

User Engagement Metrics:

User Retention Rates: Frequency of users returning.
Session Duration: Average time users spend interacting.
Interaction Frequency: How often users engage.
User Feedback Scores: Direct user input for improvements.

Observability:

Comprehensive Logging: Capture detailed system behavior.
Distributed Tracing: Track requests to identify bottlenecks or failures.
Real-Time Dashboards: Visualize metrics and respond to issues.

Tools for Monitoring

We can use AI-powered tools like Grafana or BigPanda for effective monitoring and analysis.

Cost Considerations for LLM Applications

Running LLM applications can be expensive, so understanding and managing costs is essential.

Key Cost Factors

Cost per Token/Character:

Most LLM APIs charge based on the number of tokens processed (input + output).
Advanced models (e.g., GPT-4) cost more than simpler ones (e.g., GPT-3.5).

Volume Discounts:

Tiered pricing structures reduce costs as usage increases.

Additional Features:

Fine-tuning or specialized models enhance performance but add costs.

Usage Limits/Quotas:

Exceeding limits can lead to unexpected charges or interruptions.

Best Practices

Compare costs between providers to find the best fit.
Use caching to reduce API calls and save on token usage.
Monitor usage to stay within limits and avoid unexpected charges.
Understand limits to design scalable applications.
For high-traffic apps, we should consider higher-tier plans or strategies like batching requests and caching responses.

Security Challenges for Publicly Accessible LLM Applications

Public-facing LLM applications come with unique security challenges.

Potential Threats

Prompt Injections: Malicious users may manipulate the model by injecting harmful or inappropriate prompts.
Sensitive Data Exposure: Models may reveal sensitive information if not properly secured.

Mitigation Strategies

Input Validation and Sanitization:

Filter out malicious prompts before processing.
Monitor and log interactions to detect suspicious activity.

Generation Guardrails:

Define content policies to filter inappropriate content.
Use tools like toxicity detectors to classify and block harmful outputs.

Output Validation:

Re-check generated content for harmful or inappropriate outputs.
Use content moderation APIs to ensure ethical and safe responses.

Agents: Automating Complex Workflows

AI agents go beyond generating text — they perform tasks autonomously, enabling complex, multi-step workflows. Let’s explore their key characteristics and how to build them.

Key Characteristics of AI Agents

Autonomy: Make decisions independently.
Goal-Oriented: Focus on completing tasks.
Interactivity: Respond to changes in their environment.
Adaptability: Learn and improve over time.

Key Components of an AI Agent

An AI agent consists of several components that work together to perform tasks effectively:

Tools

Access to various tools (e.g., Calendar, CodeInterpreter, Website Scraping, APIs, etc.).
Enable tasks ranging from simple calculations to complex problem-solving.
Dynamically select and use tools as needed.

Memory

Short-Term Memory: Stores information for immediate tasks.
Long-Term Memory: Retains knowledge over time, enabling learning and adaptation.

Planning

Creates strategies for task execution.
Breaks down complex tasks into smaller, manageable steps.

Execution and Feedback

Executes planned actions using tools and memory.
Feedback Loop: Results feed back into planning, allowing dynamic refinement.

LangGraph: A Visual Tool for Agent Workflows

LangGraph is also a component of the LangChain ecosystem that provides a "visual" interface for designing and managing agent workflows. It simplifies the process of connecting different components and orchestrating complex tasks.

Capabilities of LangGraph:

Interact with Users: Handle user inputs and provide meaningful responses.
Access External Services: Integrate with APIs, databases, or other tools.
Perform Tasks Autonomously: Execute multi-step workflows without manual intervention.
Building Graphs: Define states, transitions, and actions to guide the agent's behavior. We can also build mulit-agent systems with different forms of communication between agents.

Example: Research and Summarization Agent

This agent performs web research and summarizes the results.

 1from langchain.chat_models import ChatOpenAI
 2from langchain.tools import DuckDuckGoSearchRun
 3from langgraph.graph import StateGraph, END
 4from typing import TypedDict, Annotated, List
 5import operator
 6
 7# Initialize models and tools
 8research_model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
 9summary_model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.3)
10search_tool = DuckDuckGoSearchRun()
11
12# Define graph state structure
13class ResearchState(TypedDict):
14    query: str
15    research_results: Annotated[List[str], operator.add]
16    summary: str
17
18graph = StateGraph(ResearchState)
19
20# Define agent functions
21def research_agent(state):
22    query = state["query"]
23    search_result = search_tool.run(query)
24    return {"research_results": [search_result]}
25
26def summarization_agent(state):
27    research_results = state["research_results"]
28    summary_prompt = f"Summarize the following research results:\n{research_results}"
29    summary = summary_model.predict(summary_prompt)
30    return {"summary": summary}
31
32# Build the graph
33graph.add_node("research", research_agent)
34graph.add_node("summarize", summarization_agent)
35graph.set_entry_point("research")
36graph.add_edge("research", "summarize")
37graph.add_edge("summarize", END)
38
39# Compile and run
40app = graph.compile()
41
42def run_research(query):
43    result = app.invoke({"query": query, "research_results": [], "summary": ""})
44    return result["summary"]
45
46# Example usage
47research_topic = "Latest advancements in AI"
48summary = run_research(research_topic)
49print(f"Research Summary on '{research_topic}':\n{summary}")

In this example, we define a research agent that performs a web search on a given topic and summarizes the results. The agent uses LangGraph to create a state machine that guides the workflow through research and summarization steps. The run_research function initiates the agent with a research query and returns the final summary.

This is what the visual representation of this simple agent workflow would look like:

Conclusion

In this final part of our blog series, we explored the critical aspects of deploying and managing AI applications for DevOps engineers. From hosting options and performance optimization to monitoring, cost management, and security, we covered the essential considerations for building reliable and efficient AI-powered systems. Additionally, we talked about AI agents, showcasing their ability to automate complex workflows and adapt to dynamic environments.

Key Takeaways from the Blog Series

This blog series explored how AI is transforming DevOps, from addressing challenges to building and deploying advanced AI applications. Let's quickly recap the key takeaways:

Part 1: The Building Blocks of DevOps AI

Challenges in DevOps: Manual processes, delayed issue detection, skill gaps, scalability issues, and security vulnerabilities.
AI’s Role: Automates tasks, improves collaboration, predicts issues, and enhances security.
Generative AI & LLMs: Large Language Models enable content creation, language understanding, and innovative workflows.
RAG: Combines LLMs with external knowledge for accurate, context-aware responses.

Part 2: Building AI Applications with LangChain

LangChain Framework: Simplifies LLM-powered app development with tools for document loading, vector stores, prompts, chains, and agents.
Practical Examples: Chat apps, Retrieval-Augmented Generation (RAG), and interactive PDF chat apps.
Ensuring Quality: Testing, evaluation metrics, and tools like LangSmith ensure reliability and performance.

Part 3: Infrastructure, Operations, Security, and Agents

Hosting Options: Cloud-based models offer scalability, while self-hosted models provide control and privacy.
Performance Optimization: Techniques like hardware acceleration, quantization, and caching improve efficiency.
Operational Excellence: Considering monitoring tools, cost management, and API limits ensure smooth operations.
Security Best Practices: Guardrails, input validation, and output moderation protect public-facing applications.
AI Agents: Autonomous systems that perform complex workflows, leveraging tools, memory, and planning.

Thank you for joining us on this journey! Stay curious and feel free to reach out if you have any questions or need further guidance.

Again, if you're hungry for more details, make sure to check out our video-recordings of our latest AI Basics or Devops Engineers Workshop on YouTube:

https://www.youtube.com/playlist?list=PLBZFt3Mc7k82DFvB2268A2M3JI4sMNiBH

Hosting Options for AI Applications

Cloud-Based Models

Advantages:

Considerations:

Self-Hosted Models

Advantages:

Challenges:

Decision Factors:

Optimizing Performance: Inference Speed

Managing Multiple Models: Language Model Proxying

LiteLLM: Simplifying Multi-Model Management

Benefits:

Example: Using LiteLLM for Multi-Model Applications

Monitoring LLM Applications

Key Metrics to Monitor

Performance Metrics:

User Engagement Metrics:

Observability:

Tools for Monitoring

Cost Considerations for LLM Applications

Key Cost Factors

Cost per Token/Character:

Volume Discounts:

Additional Features:

Usage Limits/Quotas:

领英推荐

Best Practices

Security Challenges for Publicly Accessible LLM Applications

Potential Threats

Mitigation Strategies

Input Validation and Sanitization:

Generation Guardrails:

Output Validation:

Agents: Automating Complex Workflows

Key Characteristics of AI Agents

Key Components of an AI Agent

Tools

Memory

Planning

Execution and Feedback

LangGraph: A Visual Tool for Agent Workflows

Capabilities of LangGraph:

Example: Research and Summarization Agent

Conclusion

Key Takeaways from the Blog Series

Part 1: The Building Blocks of DevOps AI

Part 2: Building AI Applications with LangChain

Part 3: Infrastructure, Operations, Security, and Agents

Infralovers GmbH的更多文章

Efficient Testing Environments with Vagrant and Ansible: Simplifying Automation

Understanding DevOps and Cloud Maturity Models: A Guide to Elevating Your IT Strategy

Bringing Cloud-Native Concepts to On-Premise: Key Benefits for Non-Cloud Companies

AI for DevOps Engineers - Part 2: Building AI Applications with LangChain

HOW TO USE AI - PART 1: THE BASICS

HashiCorp Nomad and Vault: Dynamic Secrets

Automating Cloud Security with HashiCorp Terraform and Vault

Optimizing Cost Management: Leveraging Resource Tagging and Mondoo Policies

Future-Proofing Your Compliance: Strategic Insights with Mondoo and Terraform

Verbesserung der Kubernetes-Sicherheit: Fokus auf den Schutz von Nodes

社区洞察

其他会员也浏览了

AI Bare Metal and Orchestration Platform by InfraCloud

Top 5 Strategies AWS Partners Use to Leverage AWS Infrastructure for Generative AI

Build AI MLOps with InfraCloud AI Platform

AWS re:Invent - Announcements And Recap

The Essential Role of Digital Infrastructure in AI: Cloud Connect and Beyond

AWS RE:Invent 2023 - Latest Innovations Unveiled

Serverless Machine Learning : Unlocking Cost Efficiency, Scalability for Public Sector

The Monthly Buzz - Jan 2025

AI, Data Mastery & Unseen Security Risks

The Future of Serverless