DevSecOps Balance
DevSecOps is a process for developing (dev) secure (sec) software that can be deployed into operational (ops) environments in a safe, efficient, and scalable fashion. As a process, it evolves over time in healthy and adaptive organizations due to changing technologies and client and organizational needs. We’ll review engineering practices that support DevSecOps, discuss how various stakeholders perceive risk in them, and conclude with key agreements that will help prioritize improvements. Security-centric examples are used throughout.
Engineering
Engineers implementing DevSecOps should follow these best practices. Some candidate technologies are mentioned where applicable.
Pipeline automation.? “I don’t know why it doesn’t work in production - it worked on my machine!” This refrain is never heard in a properly pipelined shop. A pipeline is a series of promotion stages, each corresponding to an environment with its own tooling and processes for handling packages. Packages are self-contained and independently deployable. Automated builds (Jenkins) create controlled versions of the packages, assuring that all and only the necessary dependencies are included. Packages move through stages that reflect organizational and customer needs. These often include development, integration, quality assurance (QA), and production. The tooling and processes for each pipeline stage evolve over time.
Shift-Left. It’s better to fix problems earlier in the software development life cycle. Thus, the pipeline should send developers automated, actionable feedback from every stage. Static code analysis tools (SonarQube) can automatically check code as its being typed, highlighting possible security concerns just like a spell checker, with explanations of why it’s a problem and showing examples of how to fix it. Load testing in QA can notify developers if the package doesn’t scale well under high load.
Trust-but-verify. Each stage of the pipeline performs gate checks that may block promotion of packages unless they pass defined criteria. For example, promotion from development to integration can require completing functional checks driven by behavior-driven development (Cucumber) and security checks (such as repeating the static code analysis done on developers’ computers to confirm that they fixed any security vulnerabilities).
Leverage open-source security information. The security community continuously discovers new vulnerabilities, including in third party dependencies and tools and in commercial products. Many security scanning tools use this information by updating themselves with details on the latest vulnerabilities, including names and versions of vulnerable products, hashes for indicators of compromise (IOC), corresponding CVEs, and mitigation details. Whenever possible, engineers should use tools that leverage this information through frequent, automatic updates (Snky, Trivy). When critical vulnerabilities hit mainstream media (log4shell), such tools should already catch those problems (along with many others that don’t hit the news), with no additional effort.
Dependency and supply-chain control. All third-party dependencies and tools are vetted by regular software component analysis (SCA) scans (Snyk, Trivy). The pipeline should use specific versions (never just “pulling the latest”) and manage them through a process-controlled dependency cache (Artifactory). Developers and the pipeline should use only that cache for creating deployable packages.
Microservice architecture. Faster customer delivery is best achieved through a loosely coupled architecture. This is easier with microservices running in containers (Docker). A microservice performs one small, focused task, with a clear, contractual, self-documented interface (OpenAPI). Implementing them as containers enables them to be orchestrated (Kubernetes). Orchestration changes the number and distribution of containers dynamically based on load and other factors. This makes each microservice more reliable, scalable, and fault tolerant, which increases the resiliency of the entire system.
Deployable package (container) construction: common, slim, simple. Each microservice is constructed from a slim base image. For example, there could be one base image for each language the organization uses (Python, Java) which contains the minimum number of operating system dependencies (Alpine). On top of that base, developers typically add one application (the compiled code they authored). If any microservice is more than a few megabytes, this is a sign of too much logic in one service and suggests a need to refactor the code. For example, business logic shouldn’t be included in a microservice that supports other functions.
Infrastructure as code (IAC). Microservice orchestration, multi-stage pipelines, and repeatable, automated, controlled builds achieve their full potential if every aspect of the delivery is managed - including the operational environments themselves. When machine instances, networking, security controls, certificates, et al. are managed through IAC (Terraform, Ansible), then the pipeline can manage environments as well as their contents.
Frequent patching and upgrading. The pipeline tooling and runtime environments (for the pipeline target stages and the pipeline itself) need regular maintenance to protect against security vulnerabilities. Maintenance and upgrading is simpler when these are implemented as IAC and containers. For example, upgrading a tool often means simply launching a new version of the tool in a new container, testing it, and promoting it.
Security through policy as code (PaC). Security checks are enforced at each stage of the pipeline through policies that are enforced through code. For example, PaC can be implemented as a configuration file that instructs the pipeline to check for safer container construction - such as assuring that containers don’t run as root and that they use corporate-approved, secure base images. As policies are added or modified, PaC is adjusted, pipelined, and tested the same as any other code.
Out-of-band security scans. Certain security scans must occur independently of any pipeline promotion. Three examples are illustrative. Dynamic code scans (ACAS) may detect vulnerabilities only in running environments, such as through the passage of time (e.g., expired certificate chains). Daily SCA on container base images sometimes finds new security vulnerabilities, which should trigger creation of a new, patched base image. And regular, automated reviews of the National Institute of Standards (NIST) Common Vulnerability and Enumeration (CVE) database will reveal any newly discovered vulnerabilities in pipeline tooling, which will motivate the organization to update those tools.
Re-pipelining as needed. When problems are found with a package after it has been accepted into a given pipeline stage (such as when out-of-band scans discover a new base image vulnerability), the related packages need to be updated and sent back through the full pipeline. This assures the updated package retains mandatory functional and security assurances, since all gates are retested in each stage (e.g., SCA scans in integration and regression testing in QA).
Auto-remediation. Once patching needs are discovered, some patches may be retrieved automatically and then the associated microservices re-pipelined to see if the patched packages work the same as before. For example, if an out-of-band scan of a base image discovers a new vulnerability, then auto-remediation would pull the next tagged release of the problem layer in the base image, build and scan a new base image, scan it to confirm the vulnerability is remediated, update PaC to enforce use of the new base image in the pipeline, branch and update all microservices that use the old version of the base, and re-pipeline them. All packages that work fine with the new base image will return to their original, pre-remediation pipeline stage, while those that “get stuck” on an earlier stage would need closer scrutiny.
Source control management (SCM). Saving everything in SCM (GitHub, GitLab) - code, configurations, PaC, IAC, requirements, QA test results, deliverables – facilitates full traceability, supports resource management and client reporting for releases, and enables auditing, additional metrics, and inputs to process engineering improvements. There’s never a “snowflake” machine, deployable package, or important artifact, nor is there ever any question about the state of play for any artifact at any time.
Stakeholder Concerns
Each of the players involved in software development, testing, delivery, and use will wish to inject their own checks-and-balances into DevSecOps. These can add friction to the pipeline, with debatable benefits and downsides. Such debates can encourage discussions of risk mitigations and process alternatives which can improve the pipeline and build greater cooperation among teams.
Customers with extreme risk aversion like air traffic safety and utilities may have processes in place to prevent production code from changing frequently. These may include air-gapped systems, independent review and scanning prior to deployment, and official accreditation procedures to certify authorization to operate (ATO). Those waterfall controls may disallow a DevSecOps pipeline that pushes code straight into production, due in part to perceived supply-chain risk. One way to address this may be with a continuous ATO process. For example, the pipeline system and related traceable and auditable processes may themselves be partially “ATO”-ed, such that any deliverable passing through the pipeline is “fast-track authorized” (subject to a smaller set of post-pipeline reviews). Another way to address this concern with customers is with greater traceability and transparency from requirements through to the deliverables. Pipeline documentation, explanation, and demonstration may also help reduce customers’ discomfort with quicker deliverable velocities into production. It may also help to highlight the benefits of quickly deploying enhancement requests and remediating newly discovered, critical security vulnerabilities.
QA involves multiple types of testing, including unit (run by developers locally, and perhaps also run in the integration environment), smoke-testing (testing base functionality and “happy paths”), full regression testing (testing “all” paths, including edge and negative testing), load testing (possibly long-running), fuzz testing (randomly or pseudo-randomly adjusting possible input parameters to try to “break” the code), and more. Test suites tend to grow over time, often starting “too small” and later growing “too large” for quick deployments. Operationally defining “good enough” can reduce debate. Blue/green deployments of newer microservices to production can reduce the risk of a big-bang, full cut-over from old to new code. This is where an (initially) small percentage of “live” traffic is directed to new services while operators closely watch system metrics for signs of trouble, and most traffic continues to flow through older services – until such time as the new services prove themselves worthy.
The security community continuously finds new supply chain risks. In 2020, NIST posted 18,915 new CVEs in the National Vulnerability Database (NVD) for existing software and libraries. These discoveries require frequent re-pipelining of code to patch the vulnerable dependencies. To the extent code is tightly coupled to fast-changing dependencies, these security updates reduce production capacity, since they sometimes take development cycles to patch and QA cycles to retest.
Developers like to use the latest technologies and third-party software. Processes to vet potential tooling before it’s added to the corporate third-party repository cache may slow development velocity and “stifle innovation.” Providing guardrail heuristics can help assure developers experiment only with reputable third-party libraries. Sample heuristics include avoid code from risky sources, use code only from well-known open-source communities, start with release candidate or long-term support versions rather than beta releases, and check code hashes against libraries of known malicious software.
Developer-visible pipeline processes, such as those allowing developers to accept (or “up-vote”) code that static scans identify as risky, facilitate development speed but risk promoting vulnerable code. For example, a developer may misunderstand what checks are being done upstream of their service to validate and sanitize the input to their service. Careful and rigorous process development – such as through code-reviews and clearly communicating the architecture regarding system boundaries and edge services – can mitigate this risk.
Legal review of licenses and third-party attribution can also put a damper on fast-moving development teams. But releasing a product with licensing problems opens the company to liability and substantial financial loss, and using software sourced from disreputable locations or poorly supported vendors adds supply-chain risk. Communicating legal guardrails internally can help provide a defense against poor third-party selections.
Release managers and product owners must balance customer needs with secure and functional deliverables. Developing quality software is constrained by the iron triangle - pick two: cost, time, or scope. To the extent product owners manage deliveries and timing with customers, it’s incumbent on them to retain full transparency. Standards for “good enough” security, testing, and functionality should be agreed upfront to avoid relationship-straining debates and contractual disputes later. Controls must be in place to ensure those standards are being met.
Improving Balance
We’ve discussed how DevSecOps engineering best practices can enable safe, repeatable, scalable, and efficient deliveries, and how different stakeholders may wish to influence the pipeline and related processes. Strategy and mutual agreements between provider and customer on the topics below will provide a foundation for these discussions.
Review the threat model. Several questions apply, including: what kind of threats is the system defending against? For example, suppose the customer and provider are defending life-critical data delivery despite expected attacks from well-funded and highly motivated threat actors, in environments that require high confidentiality, integrity, and availability, while operating in air-gapped or geographically distributed environments. In this scenario, the threshold for “acceptable security risk” may be very low, and the need for rigorous controls correspondingly high. Once the threat model is understood, case studies of breached systems facing similar threat models will illuminate some of the edges that need defending. For example: targeted supply chain risks (SolarWinds attack) and chained zero-day vulnerability attacks against air-gapped systems (Stuxnet). Sample lessons learned from these cases include the importance of testing patches in detonation sandboxes (made easier through our use of containers and IAC) and controlling information security to prevent leaking intelligence to potential attackers.
Thresholds for functional and security testing must be operationally defined and enforced according to the rigor demanded by the threat model. For example, a “critical” security problem may be defined as a vulnerability that would allow remote threat actors to launch an unauthenticated exploit against a system using a publicly posted proof of concept (POC) to achieve full remote access to the affected host (log4shell). Or a lower bar for “critical” may be defined as any vulnerability that allows remote, unauthenticated exploits. Definition, detection methods, and response expectations for each level of criticality should be agreed. Perhaps critical problems must be addressed automatically (i.e., by re-pipelining the affected code) within one day. Where automated response isn’t possible, process controls and perhaps business process modelling systems may be employed. At the very least, the organization should use simple, meaningful, and actionable metrics to help monitor response and motivate improvements. The goal is to handle problems according to their severity and service level agreements (SLAs). The provider’s challenge then shifts from whack-a-mole for addressing specific problems to one of process and pipeline tuning.
Traceability and transparency are important to show which deliveries fulfill which requirements, with independent confirmation by automated tooling and recorded test reports pinned to specific versions of packages, including all related code and dependencies. The ability to generate “completeness check” reports on demand, based on the current state of deliverables in any environment, helps to prioritize current provider efforts and enhance communications between provider and customer. Mutual agreement on security controls can start with industry-recognized standards (NIST 800-37, 800-171, 800-53, OWASP Top 10, et al.), selecting and applying controls appropriate to the threat model.
These agreements serve as the foundation upon which DevSecOps engineers can prioritize pipeline improvements and systems engineers and subject matter experts can offer process improvements that measurably improve delivery capacity, efficiency, and safety in the service of meeting customer needs.