End-to-End Patterns for AWS Enterprise Cloud CI/CD

End-to-End Patterns for AWS Enterprise Cloud CI/CD

In the AWS ecosystem, the choice of tools for storing source code, running tests, and producing software packages hinges on the cost of ownership and the ability to easily and securely integrate with AWS as well as other specific organizational requirements.

As an executive or an architect, the goal is to achieve end-to-end integration and produce a secure, scalable, highly available, and functional workload. As an engineer, understanding what options are available to solve a particular problem is of value as well.

Another consideration is team size and whether it is a multi-team collaboration or a single team that handles the entire deployment, including various types of testing.

While smaller teams may opt for one set of tools, larger and more modular teams will likely require a different selection. Given that DataArt typically interacts with modular teams, this article will focus on the toolsets most appropriate for these types of teams.

When organizations initially migrate to the cloud, they often carry over tools that have proven effective in their on-premises data centers. However, these tools may not translate well to a cloud environment. Therefore, tooling should be reevaluated for its suitability in the new cloud-based context.

Storing, managing, building, testing, and deploying code should be an iterative, multi-discipline process. The DevOps team must be in alignment with the rest of the organization and should work with the Application Development team(s) in harmonizing tool selection.

Note: This article includes references to specific third-party, usually open-source, solutions. These are examples, not endorsements.

Tool Choices and Basic Pipeline

AWS's Code* suite is optimized for fully cloud-native solutions within AWS. However, for enterprises operating in multi-cloud or hybrid-cloud environments, the decision-making becomes more complex.

Setting up a self-hosted Git repository is a two-step process: first, install Git (e.g., apt install git), and then set it up as a service(Link). Graphical User Interfaces (GUIs) are especially useful for the day-to-day management of Git repositories. When choosing your GUI, prioritize features like Pull Requests and Merging. Options such as GitHub, BitBucket, Gitea, and Gitlab should be evaluated based on organizational needs and cost-effectiveness.

In modern enterprise environments, there are inevitably dependencies on open-source libraries. How will you store, standardize, and validate the security, in other words, manage your software supply chain? Artifact repositories like JFrog, Nexus, and AWS CodeArtifact should be considered.

For pipeline tools, a similar decision-making process applies. Managed solutions like AWS CodeBuild, GitHub Actions, Jenkins or it's service provider, CloudBees, or Harness are hosted by a provider. Medium-sized organizations should anticipate dedicating at least 50% of one staff member's time to managing self-hosted pipelines such as Jenkins. Many opt for managed solutions to minimize administrative overhead, allowing the development team to focus on deliverables rather than system maintenance.

The choice of a pipeline tool will depend on internal expertise, willingness to support the infrastructure associated with the pipeline, and the typical types of deployed code.

It is paramount to answer the following questions:

  • What type of environment will the pipeline support (Cloud-native? On-premises? Multi-cloud? Hybrid-cloud? All of the above?)
  • How much manpower can you allocate to allocate personnel resources to managing the pipeline ecosystem?
  • What security risks are acceptable, and are you willing to assume with the pipeline itself? How do you evaluate open-source plugins and the risks they pose?
  • Can the pipeline support multiple teams working on it concurrently, each with specific responsibilities?
  • Does the tool enable software supply chain management effectively?
  • Does it adequately support the organization's regulatory and compliance requirements?
  • Can it deploy workloads in a timely manner?

When selecting the pipeline, consider that multiple teams, including security and various QA teams, will be working on the same pipeline. Can the work on the pipeline itself be accomplished with little or no impact on the development team?

Following security and performance scans, a finding of any sort is impactful, but a finding, like another feature/bug, needs to be prioritized in alignment with the business requirements.

If a security flaw is detected, this information should be channeled into a Cloud Security Posture Management (CSPM) tool, such as AWS Security Hub so that a holistic review can be done. Subsequently, the issue should be pushed into the developers' task management tool.

As we go through various phases of storage, build and deploy, there will be several cases where we scan for roughly the same kinds of flaws at different parts of the pipeline and even after deployment. This is intentional. Each type of storage repository and its code objects require different types of scanners to ensure no new issues have emerged during the build process. Nothing is static. After an instance or a container goes live, threats can emerge through no fault of the development team. Prompt remediation is essential to prevent exploitation by malicious actors.

A basic pipeline that starts with a Git repository might look something like this:

This example pipeline touches on the basic elements of a pipeline. Git stores and manages the code. The Continuous Integration (CI) process pulls in dependencies, from an artifactory and builds the code, specifically a container in this case. In turn, another Continuous Deployment (CD) pipeline deploys the workload, followed by a final set of pipelines that test the deployment.

Unit Testing

Once the basic pipeline is enabled, unit tests are ideally performed on the code.

Securing Pipelines

User Authentication

Controlling access to the pipeline is essential. User Authentication to any component is ultimately tied to an LDAP/ActiveDirectory implementation, although it may present itself via tools like AWS IAM Identity Center, Ping or Okta. How pipeline users are authenticated is not a decision typically in the pipeline manager's purview, yet interoperating with those company standards is essential.

Service Authentication

Can your organization's service credentials be rotated regularly, in an automatic fashion? Within AWS this task is straightforward and involves the use of Secrets Manager and the use of service IAM roles.

Within AWS the use of IAM users is not required, automatic and dynamic credentials can be supplied by IAM service Roles. The best practice is to minimize the use of IAM Users.

Outside of AWS, services requiring AWS permissions are going to be a challenge. For hybrid solutions requiring on-premises services to access

AWS services, IAM Roles Anywhere is a possible solution.

For external SaaS services, the ability to rotate IAM User credentials will depend on several factors, including whether the API of your service provider supports credential rotation via API. If not, ask for it. You'll need to comply with your corporate policy, but do your best to exceed that, if you can automate. Corporate policies were not designed, as a rule, for modern non-static Cloud systems.

Infrastructure Security

When using the AWS Code* services for your pipelines, you inherit AWS IAM Roles to help you to those secure pipelines. However, it's the administrators' responsibility to configure the associated IAM policies, to align with the principle of least privilege.

For non-AWS pipelines, hosted in AWS, security becomes more complex. We generally recommend setting up a Deployment account. Access to this account will be driven by Service Control Policies. Within this account, you'll need to configure instances (or containers) for least privileged access to only the resources it needs, for example, specific accounts and specific services within those accounts. If you use Terraform, be sure to restrict access to the S3 buckets that are used to store state files, they contain applied security data.

Pipeline Security

Ensure that your pipeline is up to date with the latest version.

If the pipeline supports plugins, ensure that you've selected those that receive regular updates. Do you have infrastructure in place to actively scan self-hosted pipeline hosts, and the pipelines and their plugins for vulnerabilities? Are these vulnerabilities sent to your CSPM?

Users and services should operate under the principle of least privilege at every stage of the CI/CD process.

Authorization Policies

The growing trend is the use of Data Authorization Policies. Solutions include Open Policy Agent (OPA) an

open-source tool, its commercial implementation from Styra or an open-source tool from AWS, CEDAR.? The use of OPA / CEDAR and AWS IAM seem to be a natural fit for cloud-native solutions.

Time Management and Static Scans

Speed is key when designing a pipeline. Trying to do everything in a single build at every execution of a pipeline is anti-pattern. Instead, by strategically building only what is necessary you can improve efficiency and increase productivity. Various strategies can be employed to improve it:

  • Building pipelines for Golden Amazon Machine Images (AMIs), for supported Operating Systems allows developers to focus on just the changes to the Operating System needed to support their workloads, generally for legacy workloads. An AMI, once generated, can be analyzed by AWS Inspector. This process is generally executed on a monthly basis across the organization, thus a process that might take one to three hours is front-loaded into one monthly job and run during off-hours.
  • Scanning source code asynchronously after changes have been pushed to repositories like Git allows for the identification of technical debt without delaying deployment. Static scans may be run only on repositories that experience changes and during off-hours.
  • Scanning dependent objects downloaded from the Internet and stored in an artifact repository ---before they are integrated into a developer's project- saves time and enforces only approved versions across the organization. Identified critical issues should be prioritized.
  • Running linting tools to validate code formatting standards can be scheduled for after-hours after code changes have been added to a repository.

Running these tests each time a workload solution is built would take hours, leading to prolonged wait times for developers, and is anti-agile.

These "asynchronous" static pipelines are essential to a well-functioning, well-governed AWS implementation, providing feedback and flagging potential issues, without delaying the developer's workflow.

Artifact Storage

Artifact Storage is used for managing your source code supply chain. Key functions include:

  • Maintaining standard library versions.
  • Centralizing artifacts for security scans.
  • Orchestrating controlled artifact upgrades.

For organizations that have compliance requirements, does the artifact repository have automated reporting to meet those needs?

Most artifact repositories double as container registries. The decision to consolidate or separate these repositories often involves input from the Information Security (InfoSec) team.

Container Registry

The use of public registries is fraught with risks, including potential

malicious code and dubious configurations. As a safety measure,

organizations typically:

  • Rebuild containers from source scripts.
  • Or, scan publicly sourced containers in a container registry to ensure compliance with organizational security requirements.

When building a container, ideally components should come from Git, the artifact repository, and the combined components are in the container registry. The container is later rescanned with specialized tools to ensure its security.

Deployment

Previously, while this code was in Git, we ran this code through linters

and various scanners to find its security defects. The deployment of IaC

is where the rubber meets the road.

Key questions include:

  • Does the IaC work as intended?
  • Is the IaC designed to work seamlessly across multiple environments?
  • Does it support high availability and data security?

Testing Basics

The testing regimen will usually cover:

  • Functional Testing
  • Performance Testing
  • Security Testing
  • Chaos Engineering Testing

Functional Testing

While it may not always fall under the DevOps umbrella, functional testing and the code that executes it must be integrated into a pipeline that serves the workload. This type of testing may cover the graphical user interface (GUI) for web and mobile as well as programmatic API testing.

Performance Testing

Performance Testing usually falls under DevOps and addresses questions such as scalability, computational efficiency, performance of data repositories, and cost-effectiveness.

Under-performance in any of these metrics necessitates adjustments.

Security Testing

Security testing in mature organizations is typically overseen by the DevSecOps team. In less mature organizations it will likely be the responsibility of a DevOps team or a combined DevOps/Security team. As mentioned earlier, some of this testing can be done asynchronously, prior to the deployment, and after the deployment.

Infrastructure penetration testing, actively scanning for security flaws, usually at network endpoints, would be the exception to this. In theory, static tests will catch all vulnerabilities, therefore Penetration Testing is validation of all prior checks and to ensure "Click Ops", or using the console to drive infrastructure changes, did not cause a regression and that application code is built and configured correctly.

Penetration testing is not limited to infrastructure components. Application flaws can generally be discovered, for example, Cross-Site-Scripting vulnerabilities, JVM configurations, API endpoint vulnerabilities, and the list goes on.

Chaos Engineering Testing

For mature firms, this step involves testing whether its high availability setup actually holds up under stress.

Chaos Testing purposely disables key features to test workload resiliency. This disabling of key features to test workload resiliency. Advanced Chaos Testing is done on production environments.

Pipelines and Disaster Recovery

For Disaster Recovery (DR), one of the biggest issues is whether the IaC code is sufficiently variablized.

We recommend a multi-region pattern like below to help answer that question:

When it's time to test your disaster recovery (DR) environment, there is a fair chance of it working smoothly, because images and IaC deployments are already working there in some state in the Development region.

This approach also ensures the proper networking infrastructure is in place.

One last point on disaster recovery, when using third-party services, ensure that they are not using the same AWS Regions as you are. If they are, make sure they have a high-availability multi-region deployment. Here's why it matters: if your workloads are impacted by a Regional outage, your third-party services may be impacted too, impeding or eliminating your ability to deploy to the DR region.

Conclusions & Summary

Building end-to-end pipelines in AWS is a non-trivial, multi-discipline exercise. It's an investment aimed at achieving a future where fewer human resources are needed for the ongoing management of a workload. The effective implementation of pipelines and IaC ensures a closer alignment with the AWS Well-Architected Framework, setting a foundation for compliance.

As you migrate to AWS, your entire toolchain should be scrutinized. Are the tools you are currently using appropriate for AWS? Deciding to abandon tools in which there is substantial institutional investment can be challenging, as is deciding to invest in new tools.

The buildout involves Cloud Architects, DevOps, DevSecOps, Application Architects, Application Developers, Quality Assurance Engineers, FinOps, or at least the outputs of the FinOps practice, and most importantly the endorsement and support of leadership.

A pipeline allows the capture, in code, of a substantial portion of a team's functional knowledge about a workload as it is being actively developed. It proves advantageous when adaptations are needed to meet changing business objectives, regulatory and compliance needs, or to address arising security challenges, reducing the manpower needed for such adjustments.

Still, the pursuit of the end-to-end vision must be approached incrementally to yield effective results.

Originally published here.

要查看或添加评论,请登录

社区洞察