Octopus and Copilot as your own personal deployment firefighter
Imagine that production is down and the cost of lost business is adding up by the minute. This is a five alarm fire and you are tasked with solving the issue.
The obvious course of action is to work out what changes may have led to the outage. But the cognitive load associated with switching focus to an outage is not trivial. I find questions from peers and customers about work they have been focusing on for weeks and asked with the assumption that I know what they are talking about feels much like being woken by an alarm from a deep sleep. It is a jarring experience that forces you to completely reorient yourself, and that is without the added pressure of an outage.
Unfortunately, the inability to context switch on demand is often seen as a personal failing. Although we all intrinsically know how draining context switching is, too often it is just seen as the cost of working in information dense environments.
So it is no wonder that our customers are asking us how they can gain better insights into their deployments. When seconds matter, it is crucial to understand what changed in a deployment and how that may have affected customer facing applications.
To support this scenario, the Octopus Extension for Copilot has been updated to combine the available information from a deployment, place it in the underlying LLMs context (which is just a fancy way of saying that we make deployment information available to the LLM that powers our extension), and then allow customers to query that information with plain text prompts.
To demonstrate how this works, lets take a look at an actual failure I experienced while deploying the code for the Copilot extension. We can see from the raw logs that a step failed to download a Docker image due to rate limiting:
Let’s now ask the Octopus Copilot extension to provide some advice with the prompt:
@octopus-ai-app?How do I fix release "0.1.1982+84df60a.2078.1" of the "Octopus Copilot Function" project to the "Security" environment?
Here is the response:
To fix the release "0.1.1982+84df60a.2078.1" of the "Octopus Copilot Function" project to the "Security" environment, you need to address the specific error that caused the deployment to fail. Here are the steps and suggestions based on the deployment logs and related contexts:
Error Analysis
The deployment failed during the "Scan for Security Vulnerabilities" step with the following error:
2024-09-25T14:37:16Z FATAL Fatal error init error: DB error: failed to download vulnerability DB: database download error: oci download error: failed to fetch the layer: GET https://ghcr.io/v2/aquasecurity/trivy-db/blobs/sha256:80ce6156df7c360a62d884c69c6b315dc4aeb45115ca9afc7fae8688921115c7: TOOMANYREQUESTS: retry-after: 43.869μs, allowed: 44000/minute
This indicates that the Trivy Docker image pull failed due to rate limiting by the GitHub Container Registry.
Suggested Solutions
1. Enable Step Retry:
- Since the error is related to an intermittent failure (rate limiting), enabling the step retry feature can help mitigate this issue. This feature is not currently enabled for the step.
- To enable step retries, go to the Octopus web portal, navigate to the "Scan for Security Vulnerabilities" step, and check the "Allow retries" option in the "Retries" section under the "Conditions" group.
2. Modify the Script to Handle Rate Limiting:
- Update the script to include retry logic for pulling the Trivy Docker image. Here is an example script with retry logic:
#!/bin/bash
echo "Pulling Trivy Docker Image"
echo "##octopus[stdout-verbose]"
RETRY_COUNT=5
RETRY_DELAY=60
for ((i=1; i<=RETRY_COUNT; i++)); do
docker pull ghcr.io/aquasecurity/trivy && break
if [[ $i -lt RETRY_COUNT ]]; then
echo "Retrying in $RETRY_DELAY seconds..."
sleep $RETRY_DELAY
else
echo "Failed to pull Trivy Docker image after $RETRY_COUNT attempts."
exit 1
fi
done
echo "##octopus[stdout-default]"
TIMESTAMP=$(date +%s%3N)
SUCCESS=0
for x in $(find . -name bom.json -type f -print); do
echo "Scanning ${x}"
# Delete any existing report file
if [[ -f "$PWD/depscan-bom.json" ]]; then
rm "$PWD/depscan-bom.json"
fi
# Generate the report, capturing the output, and ensuring $? is set to the exit code
OUTPUT=$(bash -c "docker run --rm -v \"$PWD/${x}:/app/${x}\" ghcr.io/aquasecurity/trivy sbom -q \"/app/${x}\"; exit \$?" 2>&1)
# Run again to generate the JSON output (https://github.com/aquasecurity/trivy/issues/268)
docker run --rm -v "$PWD:/app" ghcr.io/aquasecurity/trivy sbom -q -f json -o /app/depscan-bom.json "/app/${x}"
new_octopusartifact $PWD/depscan-bom.json
CRITICAL=$(jq '[.Results[] | .Vulnerabilities[]? | select(.Severity == "CRITICAL")] | length' depscan-bom.json)
HIGH=$(jq '[.Results[] | .Vulnerabilities[]? | select(.Severity == "HIGH")] | length' depscan-bom.json)
if [[ "#{Octopus.Environment.Name}" == "Security" ]]
then
write_highlight "?? ${CRITICAL} critical vulnerabilities"
write_highlight "?? ${HIGH} high vulnerabilities"
fi
# Success is set to 1 if the exit code is not zero
if [[ $? -ne 0 ]]; then
SUCCESS=1
fi
# Print the output stripped of ANSI colour codes
echo -e "${OUTPUT}" | sed 's/\\x1b\\[[0-9;]*m//g'
done
# Cleanup
for i in {1..10}
do
chmod -R +rw bundle &> /dev/null
rm -rf bundle &> /dev/null
if [[ $? == 0 ]]; then break; fi
echo "Attempting to clean up files"
sleep 1
done
set_octopusvariable "VerificationResult" $SUCCESS
exit 0
Relevant Solutions from General Support Tickets and Issues
- Rate Limiting: As seen in the support ticket, rate limiting issues can be mitigated by implementing retry logic in the script.
- Dynamic Workers: Ensure that the dynamic worker has the necessary permissions and network access to pull the Docker image.
By implementing these changes, you should be able to resolve the deployment issue and successfully deploy the "Octopus Copilot Function" project to the "Security" environment.
The LLM (GPT-4o in our case) has successfully identified that the error was due to rate limiting. It correctly identified that using the retry logic in Octopus may fix the error. It has also provided an update to the step to add retry logic to the script. And all of this took about a minute.
To put this into context, GPT-4o has a context window of 128,000 tokens. Generally speaking, a token is equal to 4 characters. This means we can provide an LLM with 128,000 words of supporting context, or roughly the equivalent of a 280 page book. And, ironically, this is not enough, meaning we have to be careful about not overfilling the context window for prompts like this.
But this tells you just how much information you would have to consume to fully understand even an average application deployment. If every line of code that was changed, every line in the deployment logs, every issue that was resolved as part of the deployment, and every step of the deployment process was printed out, it would be the equivalent of a short novel. Is it any wonder this kind of context switching is so arduous?
But, with the Octopus Copilot Extension, we can process all this information in minutes, provide a summary, and suggest a path forward.
Copilot Extensions are now in public beta, meaning anyone can sign up. Check out our documentation for more details.
Lead Devops Specialist | CI / CD Expert | Senior Software Developer | Team player & Collaborator
5 个月They should apply LLMs on the feature called insights because its not useful at all , since it doesnt give lot of help to identify : - most problematic projects (deployment failure, guidance etc) - most problematic steps in project - what is taking most of the time on a project deployment - issues caused by someone changing variables, etc Everything is a hard analysis of audit and logs