Octopus and Copilot as your own personal deployment firefighter

Octopus and Copilot as your own personal deployment firefighter

Imagine that production is down and the cost of lost business is adding up by the minute. This is a five alarm fire and you are tasked with solving the issue.

The obvious course of action is to work out what changes may have led to the outage. But the cognitive load associated with switching focus to an outage is not trivial. I find questions from peers and customers about work they have been focusing on for weeks and asked with the assumption that I know what they are talking about feels much like being woken by an alarm from a deep sleep. It is a jarring experience that forces you to completely reorient yourself, and that is without the added pressure of an outage.

Unfortunately, the inability to context switch on demand is often seen as a personal failing. Although we all intrinsically know how draining context switching is, too often it is just seen as the cost of working in information dense environments.

So it is no wonder that our customers are asking us how they can gain better insights into their deployments. When seconds matter, it is crucial to understand what changed in a deployment and how that may have affected customer facing applications.

To support this scenario, the Octopus Extension for Copilot has been updated to combine the available information from a deployment, place it in the underlying LLMs context (which is just a fancy way of saying that we make deployment information available to the LLM that powers our extension), and then allow customers to query that information with plain text prompts.


To demonstrate how this works, lets take a look at an actual failure I experienced while deploying the code for the Copilot extension. We can see from the raw logs that a step failed to download a Docker image due to rate limiting:

Let’s now ask the Octopus Copilot extension to provide some advice with the prompt:

@octopus-ai-app?How do I fix release "0.1.1982+84df60a.2078.1" of the "Octopus Copilot Function" project to the "Security" environment?        

Here is the response:


To fix the release "0.1.1982+84df60a.2078.1" of the "Octopus Copilot Function" project to the "Security" environment, you need to address the specific error that caused the deployment to fail. Here are the steps and suggestions based on the deployment logs and related contexts:

Error Analysis

The deployment failed during the "Scan for Security Vulnerabilities" step with the following error:

2024-09-25T14:37:16Z	FATAL	Fatal error	init error: DB error: failed to download vulnerability DB: database download error: oci download error: failed to fetch the layer: GET https://ghcr.io/v2/aquasecurity/trivy-db/blobs/sha256:80ce6156df7c360a62d884c69c6b315dc4aeb45115ca9afc7fae8688921115c7: TOOMANYREQUESTS: retry-after: 43.869μs, allowed: 44000/minute        

This indicates that the Trivy Docker image pull failed due to rate limiting by the GitHub Container Registry.

Suggested Solutions

1. Enable Step Retry:

- Since the error is related to an intermittent failure (rate limiting), enabling the step retry feature can help mitigate this issue. This feature is not currently enabled for the step.

- To enable step retries, go to the Octopus web portal, navigate to the "Scan for Security Vulnerabilities" step, and check the "Allow retries" option in the "Retries" section under the "Conditions" group.

2. Modify the Script to Handle Rate Limiting:

- Update the script to include retry logic for pulling the Trivy Docker image. Here is an example script with retry logic:

#!/bin/bash

echo "Pulling Trivy Docker Image"
echo "##octopus[stdout-verbose]"

RETRY_COUNT=5
RETRY_DELAY=60

for ((i=1; i<=RETRY_COUNT; i++)); do
    docker pull ghcr.io/aquasecurity/trivy && break
    if [[ $i -lt RETRY_COUNT ]]; then
        echo "Retrying in $RETRY_DELAY seconds..."
        sleep $RETRY_DELAY
    else
        echo "Failed to pull Trivy Docker image after $RETRY_COUNT attempts."
        exit 1
    fi
done

echo "##octopus[stdout-default]"

TIMESTAMP=$(date +%s%3N)
SUCCESS=0
for x in $(find . -name bom.json -type f -print); do
    echo "Scanning ${x}"

    # Delete any existing report file
    if [[ -f "$PWD/depscan-bom.json" ]]; then
      rm "$PWD/depscan-bom.json"
    fi

    # Generate the report, capturing the output, and ensuring $? is set to the exit code
    OUTPUT=$(bash -c "docker run --rm -v \"$PWD/${x}:/app/${x}\" ghcr.io/aquasecurity/trivy sbom -q \"/app/${x}\"; exit \$?" 2>&1)

    # Run again to generate the JSON output (https://github.com/aquasecurity/trivy/issues/268)
    docker run --rm -v "$PWD:/app" ghcr.io/aquasecurity/trivy sbom -q -f json -o /app/depscan-bom.json "/app/${x}"

    new_octopusartifact $PWD/depscan-bom.json

    CRITICAL=$(jq '[.Results[] | .Vulnerabilities[]? | select(.Severity == "CRITICAL")] | length' depscan-bom.json)
    HIGH=$(jq '[.Results[] | .Vulnerabilities[]? | select(.Severity == "HIGH")] | length' depscan-bom.json)

    if [[ "#{Octopus.Environment.Name}" == "Security" ]]
    then
      write_highlight "?? ${CRITICAL} critical vulnerabilities"
      write_highlight "?? ${HIGH} high vulnerabilities"
    fi

    # Success is set to 1 if the exit code is not zero
    if [[ $? -ne 0 ]]; then
        SUCCESS=1
    fi

    # Print the output stripped of ANSI colour codes
    echo -e "${OUTPUT}" | sed 's/\\x1b\\[[0-9;]*m//g'
done

# Cleanup
for i in {1..10}
do
    chmod -R +rw bundle &> /dev/null
    rm -rf bundle &> /dev/null
    if [[ $? == 0 ]]; then break; fi
    echo "Attempting to clean up files"
    sleep 1
done

set_octopusvariable "VerificationResult" $SUCCESS

exit 0        

Relevant Solutions from General Support Tickets and Issues

- Rate Limiting: As seen in the support ticket, rate limiting issues can be mitigated by implementing retry logic in the script.

- Dynamic Workers: Ensure that the dynamic worker has the necessary permissions and network access to pull the Docker image.

By implementing these changes, you should be able to resolve the deployment issue and successfully deploy the "Octopus Copilot Function" project to the "Security" environment.


The LLM (GPT-4o in our case) has successfully identified that the error was due to rate limiting. It correctly identified that using the retry logic in Octopus may fix the error. It has also provided an update to the step to add retry logic to the script. And all of this took about a minute.

To put this into context, GPT-4o has a context window of 128,000 tokens. Generally speaking, a token is equal to 4 characters. This means we can provide an LLM with 128,000 words of supporting context, or roughly the equivalent of a 280 page book. And, ironically, this is not enough, meaning we have to be careful about not overfilling the context window for prompts like this.

But this tells you just how much information you would have to consume to fully understand even an average application deployment. If every line of code that was changed, every line in the deployment logs, every issue that was resolved as part of the deployment, and every step of the deployment process was printed out, it would be the equivalent of a short novel. Is it any wonder this kind of context switching is so arduous?

But, with the Octopus Copilot Extension, we can process all this information in minutes, provide a summary, and suggest a path forward.

Copilot Extensions are now in public beta, meaning anyone can sign up. Check out our documentation for more details.

Félix Brunet Girard

Lead Devops Specialist | CI / CD Expert | Senior Software Developer | Team player & Collaborator

5 个月

They should apply LLMs on the feature called insights because its not useful at all , since it doesnt give lot of help to identify : - most problematic projects (deployment failure, guidance etc) - most problematic steps in project - what is taking most of the time on a project deployment - issues caused by someone changing variables, etc Everything is a hard analysis of audit and logs

要查看或添加评论,请登录

Matthew Casperson的更多文章

  • Reducing the "Time to Understand Customer" with AI

    Reducing the "Time to Understand Customer" with AI

    DevOps teams have long embraced the value of consistent metrics to measure their performance, with the DORA metrics…

  • Boiling the DevOps frog

    Boiling the DevOps frog

    These are the requirements for a random “junior full-stack developer” role I found advertised this morning. These…

  • Process > LLMs

    Process > LLMs

    What makes a generalist? It is easy to think of generalists as jack of all trades and masters of none. Generalists are…

    1 条评论
  • Leetcode is awful

    Leetcode is awful

    We all know the familiar sitcom grocery store trope involving a pyramid of cans stacked in the middle of an aisle…

  • Replacing jobs with GenAI is the worst of DevOps all over again

    Replacing jobs with GenAI is the worst of DevOps all over again

    There is no doubt that DevOps called out some of the worst practices in IT departments. The inefficiencies of…

  • DevOps is a flat circle

    DevOps is a flat circle

    Moving from an engineering role into a highly technical sales role provides an amazing vantage point from which to…

  • (Almost) no one cares about your platform

    (Almost) no one cares about your platform

    I had the pleasure of attending, and presenting at, GitHub universe this year, and like most conferences, it was the…

  • Strangers deploying microservices

    Strangers deploying microservices

    I’ll admit I was skeptical when the GitHub team told us about the interactive sandbox sessions at GitHub Universe. If…

  • Deployment insights for everyone with LLMs

    Deployment insights for everyone with LLMs

    Logging levels are something that developers take for granted. I want to see WARN and above logs for my day to day…

    1 条评论
  • LLMs are not magic or scary

    LLMs are not magic or scary

    Tools like ChatGPT can seem like magic. It is now almost impossible to determine if text is written by a person or…

    1 条评论

社区洞察