登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Octopus and Copilot as your own personal deployment firefighter

Matthew Casperson

Principal Solutions Engineer at Octopus Deploy

发布日期: 2024年10月3日

Imagine that production is down and the cost of lost business is adding up by the minute. This is a five alarm fire and you are tasked with solving the issue.

The obvious course of action is to work out what changes may have led to the outage. But the cognitive load associated with switching focus to an outage is not trivial. I find questions from peers and customers about work they have been focusing on for weeks and asked with the assumption that I know what they are talking about feels much like being woken by an alarm from a deep sleep. It is a jarring experience that forces you to completely reorient yourself, and that is without the added pressure of an outage.

Unfortunately, the inability to context switch on demand is often seen as a personal failing. Although we all intrinsically know how draining context switching is, too often it is just seen as the cost of working in information dense environments.

So it is no wonder that our customers are asking us how they can gain better insights into their deployments. When seconds matter, it is crucial to understand what changed in a deployment and how that may have affected customer facing applications.

To support this scenario, the Octopus Extension for Copilot has been updated to combine the available information from a deployment, place it in the underlying LLMs context (which is just a fancy way of saying that we make deployment information available to the LLM that powers our extension), and then allow customers to query that information with plain text prompts.

To demonstrate how this works, lets take a look at an actual failure I experienced while deploying the code for the Copilot extension. We can see from the raw logs that a step failed to download a Docker image due to rate limiting:

Let’s now ask the Octopus Copilot extension to provide some advice with the prompt:

@octopus-ai-app?How do I fix release "0.1.1982+84df60a.2078.1" of the "Octopus Copilot Function" project to the "Security" environment?

Here is the response:

To fix the release "0.1.1982+84df60a.2078.1" of the "Octopus Copilot Function" project to the "Security" environment, you need to address the specific error that caused the deployment to fail. Here are the steps and suggestions based on the deployment logs and related contexts:

Error Analysis

The deployment failed during the "Scan for Security Vulnerabilities" step with the following error:

2024-09-25T14:37:16Z	FATAL	Fatal error	init error: DB error: failed to download vulnerability DB: database download error: oci download error: failed to fetch the layer: GET https://ghcr.io/v2/aquasecurity/trivy-db/blobs/sha256:80ce6156df7c360a62d884c69c6b315dc4aeb45115ca9afc7fae8688921115c7: TOOMANYREQUESTS: retry-after: 43.869μs, allowed: 44000/minute

This indicates that the Trivy Docker image pull failed due to rate limiting by the GitHub Container Registry.

Suggested Solutions

1. Enable Step Retry:

- Since the error is related to an intermittent failure (rate limiting), enabling the step retry feature can help mitigate this issue. This feature is not currently enabled for the step.

- To enable step retries, go to the Octopus web portal, navigate to the "Scan for Security Vulnerabilities" step, and check the "Allow retries" option in the "Retries" section under the "Conditions" group.

2. Modify the Script to Handle Rate Limiting:

- Update the script to include retry logic for pulling the Trivy Docker image. Here is an example script with retry logic:

#!/bin/bash

echo "Pulling Trivy Docker Image"
echo "##octopus[stdout-verbose]"

RETRY_COUNT=5
RETRY_DELAY=60

for ((i=1; i<=RETRY_COUNT; i++)); do
    docker pull ghcr.io/aquasecurity/trivy && break
    if [[ $i -lt RETRY_COUNT ]]; then
        echo "Retrying in $RETRY_DELAY seconds..."
        sleep $RETRY_DELAY
    else
        echo "Failed to pull Trivy Docker image after $RETRY_COUNT attempts."
        exit 1
    fi
done

echo "##octopus[stdout-default]"

TIMESTAMP=$(date +%s%3N)
SUCCESS=0
for x in $(find . -name bom.json -type f -print); do
    echo "Scanning ${x}"

    # Delete any existing report file
    if [[ -f "$PWD/depscan-bom.json" ]]; then
      rm "$PWD/depscan-bom.json"
    fi

    # Generate the report, capturing the output, and ensuring $? is set to the exit code
    OUTPUT=$(bash -c "docker run --rm -v \"$PWD/${x}:/app/${x}\" ghcr.io/aquasecurity/trivy sbom -q \"/app/${x}\"; exit \$?" 2>&1)

    # Run again to generate the JSON output (https://github.com/aquasecurity/trivy/issues/268)
    docker run --rm -v "$PWD:/app" ghcr.io/aquasecurity/trivy sbom -q -f json -o /app/depscan-bom.json "/app/${x}"

    new_octopusartifact $PWD/depscan-bom.json

    CRITICAL=$(jq '[.Results[] | .Vulnerabilities[]? | select(.Severity == "CRITICAL")] | length' depscan-bom.json)
    HIGH=$(jq '[.Results[] | .Vulnerabilities[]? | select(.Severity == "HIGH")] | length' depscan-bom.json)

    if [[ "#{Octopus.Environment.Name}" == "Security" ]]
    then
      write_highlight "?? ${CRITICAL} critical vulnerabilities"
      write_highlight "?? ${HIGH} high vulnerabilities"
    fi

    # Success is set to 1 if the exit code is not zero
    if [[ $? -ne 0 ]]; then
        SUCCESS=1
    fi

    # Print the output stripped of ANSI colour codes
    echo -e "${OUTPUT}" | sed 's/\\x1b\\[[0-9;]*m//g'
done

# Cleanup
for i in {1..10}
do
    chmod -R +rw bundle &> /dev/null
    rm -rf bundle &> /dev/null
    if [[ $? == 0 ]]; then break; fi
    echo "Attempting to clean up files"
    sleep 1
done

set_octopusvariable "VerificationResult" $SUCCESS

exit 0

Relevant Solutions from General Support Tickets and Issues

- Rate Limiting: As seen in the support ticket, rate limiting issues can be mitigated by implementing retry logic in the script.

- Dynamic Workers: Ensure that the dynamic worker has the necessary permissions and network access to pull the Docker image.

By implementing these changes, you should be able to resolve the deployment issue and successfully deploy the "Octopus Copilot Function" project to the "Security" environment.

The LLM (GPT-4o in our case) has successfully identified that the error was due to rate limiting. It correctly identified that using the retry logic in Octopus may fix the error. It has also provided an update to the step to add retry logic to the script. And all of this took about a minute.

To put this into context, GPT-4o has a context window of 128,000 tokens. Generally speaking, a token is equal to 4 characters. This means we can provide an LLM with 128,000 words of supporting context, or roughly the equivalent of a 280 page book. And, ironically, this is not enough, meaning we have to be careful about not overfilling the context window for prompts like this.

But this tells you just how much information you would have to consume to fully understand even an average application deployment. If every line of code that was changed, every line in the deployment logs, every issue that was resolved as part of the deployment, and every step of the deployment process was printed out, it would be the equivalent of a short novel. Is it any wonder this kind of context switching is so arduous?

But, with the Octopus Copilot Extension, we can process all this information in minutes, provide a summary, and suggest a path forward.

Copilot Extensions are now in public beta, meaning anyone can sign up. Check out our documentation for more details.

Félix Brunet Girard

Lead Devops Specialist | CI / CD Expert | Senior Software Developer | Team player & Collaborator

5 个月

They should apply LLMs on the feature called insights because its not useful at all , since it doesnt give lot of help to identify : - most problematic projects (deployment failure, guidance etc) - most problematic steps in project - what is taking most of the time on a project deployment - issues caused by someone changing variables, etc Everything is a hard analysis of audit and logs

2 次回应

查看更多评论

要查看或添加评论，请登录

Matthew Casperson的更多文章

Reducing the "Time to Understand Customer" with AI

2025年2月15日

Reducing the "Time to Understand Customer" with AI

DevOps teams have long embraced the value of consistent metrics to measure their performance, with the DORA metrics…
Boiling the DevOps frog

2025年1月12日

Boiling the DevOps frog

These are the requirements for a random “junior full-stack developer” role I found advertised this morning. These…
Process > LLMs

2025年1月5日

Process > LLMs

What makes a generalist? It is easy to think of generalists as jack of all trades and masters of none. Generalists are…

1 条评论
Leetcode is awful

2024年12月31日

Leetcode is awful

We all know the familiar sitcom grocery store trope involving a pyramid of cans stacked in the middle of an aisle…
Replacing jobs with GenAI is the worst of DevOps all over again

2024年11月20日

Replacing jobs with GenAI is the worst of DevOps all over again

There is no doubt that DevOps called out some of the worst practices in IT departments. The inefficiencies of…
DevOps is a flat circle

2024年11月16日

DevOps is a flat circle

Moving from an engineering role into a highly technical sales role provides an amazing vantage point from which to…
(Almost) no one cares about your platform

2024年11月11日

(Almost) no one cares about your platform

I had the pleasure of attending, and presenting at, GitHub universe this year, and like most conferences, it was the…
Strangers deploying microservices

2024年10月31日

Strangers deploying microservices

I’ll admit I was skeptical when the GitHub team told us about the interactive sandbox sessions at GitHub Universe. If…
Deployment insights for everyone with LLMs

2024年10月18日

Deployment insights for everyone with LLMs

Logging levels are something that developers take for granted. I want to see WARN and above logs for my day to day…

1 条评论
LLMs are not magic or scary

2024年9月28日

LLMs are not magic or scary

Tools like ChatGPT can seem like magic. It is now almost impossible to determine if text is written by a person or…

1 条评论

See all articles

Error Analysis

Suggested Solutions

Relevant Solutions from General Support Tickets and Issues

Matthew Casperson的更多文章

Reducing the "Time to Understand Customer" with AI

Boiling the DevOps frog

Process > LLMs

Leetcode is awful

Replacing jobs with GenAI is the worst of DevOps all over again

DevOps is a flat circle

(Almost) no one cares about your platform

Strangers deploying microservices

Deployment insights for everyone with LLMs

LLMs are not magic or scary

社区洞察