Crossplane Custom Resources: Measure and Report on the Health of Your Cloud Resources
Shea Stewart
Technologist working in Platform & Customer Engineering Capacities @ RunWhen
We are big fans of the Crossplane solution (open sourced by Upbound ). It can define and continuously reconcile cloud infrastructure as data. The solution provides Kubernetes controllers and custom resource definitions that become the control plane for an organizations cloud infrastructure.
This use case (also documented at length here) demonstrates how to report on the health of Crossplane resources and triage resource issues in the following ways:
- Obtaining a Service Level Indicator metric in one of the following ways:
- Using the k8s-kubectl-get codebundle to count the number of unhealthy resources managed by Crossplane as an SLI metric
- Implementing Custom Resource State Metrics for kube-state-metrics and using the gcp-opssuite-promql codebundle to produce an SLI metric. Note: Other Prometheus codebundles can be used as well, such as prometheus-queryinstant-transform or sysdig-monitor-promqlmetric
- Configuring a Service Level Objective to generate alerts for unhealthy objects
- Obtaining a report of unhealthy resources using the k8s-kubectl-run codebundle
The above capabilities are demonstrated in conjunction with the following RunWhen features:
Let's get started!?
Inspecting Crossplane Custom Resources
There are a number of providers available for the Crossplane solution. In our environment we tend to rely on three primary providers:
- GCP
- Kubernetes
- Helm
Each of these providers come from the open source community in the crossplane-contrib repo. Let's review an object from each provider, as there may be differences in how the resource states are reported.
Example GCP Crossplane Resource
This provider allows Crossplane to create and manage GCP resources. Let's look at the available custom resources available with this provider:
$ kubectl get crd | grep gcp.crossplane.io
addresses.compute.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:45Z
bucketpolicies.storage.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? 2022-05-24T15:34:49Z
bucketpolicymembers.storage.gcp.crossplane.io? ? ? ? ? ? ? ?2022-05-24T15:34:50Z
buckets.storage.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:46Z
cloudmemorystoreinstances.cache.gcp.crossplane.io? ? ? ? ? ?2022-05-24T15:34:49Z
cloudsqlinstances.database.gcp.crossplane.io? ? ? ? ? ? ? ? 2022-05-24T15:34:45Z
clusters.container.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? 2022-05-24T15:34:49Z
connections.servicenetworking.gcp.crossplane.io? ? ? ? ? ? ?2022-05-24T15:34:48Z
containerregistries.registry.gcp.crossplane.io? ? ? ? ? ? ? 2022-05-24T15:34:46Z
cryptokeypolicies.kms.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:46Z
cryptokeys.kms.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? ? 2022-05-24T15:34:49Z
firewalls.compute.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:47Z
globaladdresses.compute.gcp.crossplane.io? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:50Z
keyrings.kms.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2022-05-24T15:34:49Z
networks.compute.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? 2022-05-24T15:34:45Z
nodepools.container.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:47Z
providerconfigs.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:48Z
providerconfigusages.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? 2022-05-24T15:34:46Z
providers.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:46Z
resourcerecordsets.dns.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? 2022-05-24T15:34:48Z
routers.compute.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:50Z
serviceaccountkeys.iam.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? 2022-05-24T15:34:48Z
serviceaccountpolicies.iam.gcp.crossplane.io? ? ? ? ? ? ? ? 2022-05-24T15:34:47Z
serviceaccounts.iam.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:48Z
subnetworks.compute.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:47Z
subscriptions.pubsub.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? 2022-05-24T15:34:46Z
topics.pubsub.gcp.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:47Z?
Digging deeper, we will look at the status section of our GKE clusters:
$ kubectl describe cluster [cluster-name]
...
...
Status:
? At Provider:
? ? ... [GKE Cluster Details Omitted]
? Conditions:
? ? Last Transition Time:? 2022-06-21T14:02:29Z
? ? Reason:? ? ? ? ? ? ? ? Available
? ? Status:? ? ? ? ? ? ? ? True
? ? Type:? ? ? ? ? ? ? ? ? Ready
? ? Last Transition Time:? 2022-07-14T10:09:12Z
? ? Reason:? ? ? ? ? ? ? ? ReconcileSuccess
? ? Status:? ? ? ? ? ? ? ? True
? ? Type:? ? ? ? ? ? ? ? ? Synced
Based on the output above, we will primarily be interested in the following status fields to report on our GCP custom resources :
Conditions:
- Type: Synced
- Status == True and,
- Type: Ready
- Status == True
Example Kubernetes Crossplane Resource
This provider allows Crossplane to configure and manage Kubernetes objects in remote clusters. Let's look at the available custom resources available with this provider:?
$ kubectl get crd | grep kubernetes.crossplane.io
objects.kubernetes.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? ? 2022-05-24T15:41:47Z
providerconfigs.kubernetes.crossplane.io? ? ? ? ? ? ? ? ? ? 2022-05-24T15:41:47Z
providerconfigusages.kubernetes.crossplane.io? ? ? ? ? ? ? ?2022-05-24T15:41:47Z?
And looking at the status section of a Kubernetes object:
$ kubectl describe object [kubernetes object name]
...
...
Status:
? At Provider:
? ? ... [Object configuration details omitted]
? Conditions:
? ? Last Transition Time:? 2022-11-28T18:25:52Z
? ? Reason:? ? ? ? ? ? ? ? ReconcileSuccess
? ? Status:? ? ? ? ? ? ? ? True
? ? Type:? ? ? ? ? ? ? ? ? Synced
? ? Last Transition Time:? 2022-06-20T18:40:15Z
? ? Reason:? ? ? ? ? ? ? ? Available
? ? Status:? ? ? ? ? ? ? ? True
? ? Type:? ? ? ? ? ? ? ? ? Ready
Events:? ? ? ? ? ? ? ? ? ? <none>
Here we can see that we are interested in the same status fields as the GCP provider.
Example Helm Crossplane Resource
Let's look at the available custom resources available with this provider:?
$ kubectl get crd | grep helm.crossplane.io
providerconfigs.helm.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? 2022-05-24T15:34:44Z
providerconfigusages.helm.crossplane.io? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:44Z
releases.helm.crossplane.io? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?2022-05-24T15:34:44Z
And looking deeper at the status section of a helm release:
$ kubectl describe release [helm release]
...
...
Status:
? At Provider:
? ? Release Description:? Install complete
? ? Revision:? ? ? ? ? ? ?1
? ? State:? ? ? ? ? ? ? ? deployed
? Conditions:
? ? Last Transition Time:? 2023-01-15T09:31:02Z
? ? Reason:? ? ? ? ? ? ? ? ReconcileSuccess
? ? Status:? ? ? ? ? ? ? ? True
? ? Type:? ? ? ? ? ? ? ? ? Synced
? ? Last Transition Time:? 2022-06-21T17:25:27Z
? ? Reason:? ? ? ? ? ? ? ? Available
? ? Status:? ? ? ? ? ? ? ? True
? ? Type:? ? ? ? ? ? ? ? ? Ready
? Synced:? ? ? ? ? ? ? ? ? true
Again we see that the resource status shares the same fields as our previous providers, meaning that reporting on these fields will help us measure overall health of our Crossplane resources (as reported by Crossplane).
Service Level Indicators
SLI Option #1: Using `kubectl get` to Fetch Crossplane Resource State
After reviewing the output above, we can create a Service Level Indicator that queries our cluster for the status of all Crossplane manged resources based on the status fields of synced and ready, and counts the total number of objects that are NOT in a healthy state.
From the commandline with kubectl and jq this might look like:
$ kubectl get clusters,objects,releases -o json | jq '[.items[] | select(.status.conditions[].status!="True")] | length '
The above command doesn't provide a whole lot of context, but simply looks for any key named status with a value that is not equal to True in the conditions of each object, and counts each occurrence. Additional filtering could be performed but it get's a bit more complicated and ultimately doesn't change the result of our desire, which is to produce a simple metric that represents any crossplane object that needs investigation.
Configuring the `k8s-kubectl-get` codebundle
To configure this SLI in RunWhen, we will use the k8s-kubectl-get codebundle. This codebundle provides a little more flexibility with regards to producing a metric for unhealthy Crossplane resources.?
- Search the catalogue for the k8s-kubectl-get codebundle
- Configure the codebundle with the following parameters:
DISTRIBUTION: Kubernetes
KUBECTL_COMMAND: kubectl get clusters,objects,releases
CALCULATION: Count
CALCULATION_FIELD: ''
SEARCH_FILTER: >-
? ? ? status.conditions[?(type==`Ready` && status!=`True`) || (type==`Synced` &&
? ? ? status!=`True`)]
- Finalize the secrets and runtime configuration and view the Service Level Indicator on the Map
As we can see from the above image, there are 2 Crossplane objects that are not considered healthy. Jump to the section below to configure a TaskSet that can send us a report with helpful details on these failures when they occur.?
SLI Option #2: Using Prometheus to Fetch Crossplane Resource State
If you're heavily invested in Prometheus and using kube-state-metrics to monitor your platforms, this this option might be an attractive approach. In this use case, we configure kube-state-metrics to produce metrics from the Crossplane custom resources used in the previous example.?
Note: This process was documented using kube-state-metrics v2.7. Custom metrics in that version will (by default), show up as kube_crd_[metricname] which changes to kube_customresource[metricname] in the 2.8 release. See these docs for more details.
Configuring the kube-state-metrics deployment
The kube-state-metrics deployment needs to be modified to load the custom resources configuration file (or the configuration can be pasted inline with the deployment). In our deployment, we add the 'extraArgs' value to the helm chart that we use to deploy kube-state-metrics. This can also be added to the 'args' field of the deployment manifest.??
extraArgs
- --custom-resource-state-config-file
- /custom-resources/config.yaml:
In addition to the custom resource configuration file specification, the collectors configuration needs to include the custom resources that we are collecting metrics from:
? ? collectors:
? ? ? ...
? ? ? ...
? ? ? - clusters
? ? ? - releases
? ? ? - objects
Configuring the kube-state-metrics configmap
A configuration stanza needs to be created for every custom resource that needs to served up by kube-state-metrics. The following configuration outlines how this can be configured for the Crossplane resources we wish to collect health metrics from.?
apiVersion: v
kind: ConfigMap
metadata:
? name: custom-resource-config
? namespace: kube-system
data:
? config.yaml: |
? ? kind: CustomResourceStateMetrics
? ? spec:
? ? ? resources:
? ? ? ? - groupVersionKind:
? ? ? ? ? ? group: "container.gcp.crossplane.io"
? ? ? ? ? ? kind: "Cluster"
? ? ? ? ? ? version: "v1beta2"
? ? ? ? ? metrics:
? ? ? ? ? ? - name: "crossplane_cluster_ready"
? ? ? ? ? ? ? help: "Crossplane Cluster Ready"
? ? ? ? ? ? ? each:
? ? ? ? ? ? ? ? type: StateSet
? ? ? ? ? ? ? ? stateSet:?
? ? ? ? ? ? ? ? ? labelName: status
? ? ? ? ? ? ? ? ? path: [status, conditions, "[type=Ready]", status]
? ? ? ? ? ? ? ? ? list: [True, False]
? ? ? ? ? ? ? labelsFromPath:
? ? ? ? ? ? ? ? name: [metadata, name]
? ? ? ? - groupVersionKind:
? ? ? ? ? ? group: "kubernetes.crossplane.io"
? ? ? ? ? ? kind: "Object"
? ? ? ? ? ? version: "v1alpha1"
? ? ? ? ? metrics:
? ? ? ? ? ? - name: "crossplane_kubernetes_object_ready"
? ? ? ? ? ? ? help: "Crossplane Kubernetes Object Ready"
? ? ? ? ? ? ? each:
? ? ? ? ? ? ? ? type: StateSet
? ? ? ? ? ? ? ? stateSet:?
? ? ? ? ? ? ? ? ? labelName: status
? ? ? ? ? ? ? ? ? path: [status, conditions, "[type=Ready]", status]
? ? ? ? ? ? ? ? ? list: [True, False]
? ? ? ? ? ? ? labelsFromPath:
? ? ? ? ? ? ? ? name: [metadata, name]
? ? ? ? - groupVersionKind:
? ? ? ? ? ? group: "helm.crossplane.io"
? ? ? ? ? ? kind: "Release"
? ? ? ? ? ? version: "v1beta1"
? ? ? ? ? metrics:
? ? ? ? ? ? - name: "crossplane_helm_release_ready"
? ? ? ? ? ? ? help: "Crossplane Helm Release Ready"
? ? ? ? ? ? ? each:
? ? ? ? ? ? ? ? type: StateSet
? ? ? ? ? ? ? ? stateSet:?
? ? ? ? ? ? ? ? ? labelName: status
? ? ? ? ? ? ? ? ? path: [status, conditions, "[type=Ready]", status]
? ? ? ? ? ? ? ? ? list: [True, False]
? ? ? ? ? ? ? labelsFromPath:
? ? ? ? ? ? ? ? name: [metadata, name]1
领英推è
Configuring the kube-state-metrics rbac
kube-state-metrics will need permissions to list and watch the resources we are monitoring. Within the Kubernetes RBAC configuration, we are adding the following additional capabilities:?
? ? ? ? - apiGroups: ["helm.crossplane.io"
? ? ? ? ? resources: ["*"]
? ? ? ? ? verbs: ["list","watch"]
? ? ? ? - apiGroups: ["container.gcp.crossplane.io"]
? ? ? ? ? resources: ["*"]
? ? ? ? ? verbs: ["list","watch"]
? ? ? ? - apiGroups: ["kubernetes.crossplane.io"]
? ? ? ? ? resources: ["*"]
? ? ? ? ? verbs: ["list","watch"]]
Validating the custom metrics are flowing
Forwarding the kube-state-metrics service to your machine allows you to see all of the available metrics that are published:?
$ kubectl port-forward svc/kube-state-metrics 8080:8080 -n kube-system
$ curl https://localhost:8080/metrics? | grep kube_crd
kube_crd_crossplane_cluster_ready{group="container.gcp.crossplane.io",kind="Cluster",name="cluster111-us-central1-01",status="False",version="v1beta2"} 0
kube_crd_crossplane_cluster_ready{group="container.gcp.crossplane.io",kind="Cluster",name="cluster111-us-central1-01",status="True",version="v1beta2"} 1
kube_crd_crossplane_cluster_ready{group="container.gcp.crossplane.io",kind="Cluster",name="cluster211-west2-01",status="False",version="v1beta2"} 0
kube_crd_crossplane_cluster_ready{group="container.gcp.crossplane.io",kind="Cluster",name="cluster211-west2-01",status="True",version="v1beta2"} 1
kube_crd_crossplane_cluster_ready{group="container.gcp.crossplane.io",kind="Cluster",name="cluster311-cluster",status="False",version="v1beta2"} 0
kube_crd_crossplane_cluster_ready{group="container.gcp.crossplane.io",kind="Cluster",name="cluster311-cluster",status="True",version="v1beta2"} 1
Note: As of the time of this writing, any changes to the custom resources configuration file appears to require a restart of the kube-state-metrics pod.
Configuring Prometheus to scrape custom metrics
With kube-state-metrics configured to surface our desired metrics, ensure that our Prometheus instances are configured to scrape our new metric patterns. Since we use Google Managed Prometheus, this means that we are modifying out ClusterPodMonitoring resource to include the pattern 'crd'.?
For example:
apiVersion: monitoring.googleapis.com/v1
kind: ClusterPodMonitoring
metadata:
? name: kube-state-metrics
spec:
? selector:
? ? matchLabels:
? ? ? app.kubernetes.io/name: kube-state-metrics
? endpoints:
? - port: http-metrics
? ? interval: 60s
? ? metricRelabeling:
? ? - action: keep
? ? ? regex: kube_(daemonset|deployment|pod|namespace|node|statefulset|crd)_.+
? ? ? sourceLabels: [__name__]
? targetLabels:
? ? metadata: [] # explicitly empty so the metric labels are respected
Note: 'crd' will change to 'customresource' in kube-state-metrics v2.8
Verifying Metrics within Google Managed Prometheus & Grafana
With Prometheus configured to send our Crossplane crd metrics to GMP, we can verify that the metrics exist leveraging the metrics explorer:?
Configuring the `gcp-opssuite-promql` codebundle
With our custom resource metrics flowing into our Prometheus instance, we will configure the gcp-opssuite-promql codebundle.?
Note: While we are using the gcp-opssuite-promql codebundle, the queries are the same for any of the promql based codebundles.
- Search and select the desired codebundle:
- Configure the codebundle values and use the following promql query to sum up any of our Crossplane custom resource objects that are not ready:
PROJECT_ID: gcp-project-i
PROMQL_STATEMENT: >-
sum(kube_crd_crossplane_cluster_ready{status="False"}) +
sum(kube_crd_crossplane_kubernetes_object_ready{status="False"}) +
sum(kube_crd_crossplane_helm_release_ready{status="False"})
TRANSFORM: Raw
- With the SLI configuration changes committed and merged, we can view this SLI on our Map:
Comparing kubectl-get vs kube-state-metrics custom resource support
While we've demonstrated two ways to measure the health of our Crossplane custom resources across our clusters, you may wonder which method is preferable. There are a few considerations that come to mind when thinking about each approach.
Considerations when using kubectl-get
- Faster to test, faster to configure
- Familiar commands the team already knows
- Applies greater load on the Kube API server
- Might require multiple points on the map (SLXs) to support multiple clusters (kubectl-get codebundle currently targets a single cluster)
- Avoids Prometheus storage costs (if using a managed Prometheus service)
- RunWhen platform is the only persistent metric storage mechanism for metric results
- Best suited to test and development environments when iterating on custom resource definition structure / etc.
Considerations when using kube-state-metrics
- More infrastructure to configure, more change management required
- Less load on on the Kube API server
- Easier to query results for multiple Kubernetes clusters
- Better suited to stable custom resource APIs
- Better aligned with many Kubernetes operational teams monitoring practices
- Best suited to production monitoring environments
Service Level Objectives
SLO: Configure an SLO to Alert on Unhealthy Crossplane Resources
The Service Level Objective for this particular use case is straightforward. Since we are adding up the total amount of resources that are not ready, and we desire them to always be ready, the SLO is set to have a desired value of Zero for 95% of the time (as we suspect the resources to exist in a "reconciling state" some of the time). We can revisit this SLO at a later date once we have some historical data to review.?
Automated Tasks
TaskSet: Generate Reports on Failing Crossplane Objects
Regardless of which method we use to report on the health of the Crossplane objects in your clusters, we need a way to provide a report with useful details when a resource is considered unhealthy.
Normally we would jump into the cluster and run 'kubectl get' to provide the following details:
- Resource Type
- Resource Name
- Reason for Failure
From the commandline, this might look something like this:?
kubectl get objects.kubernetes.crossplane.io,releases.helm.crossplane.io,clusters.container.gcp.crossplane.io --all-namespaces -o json | jq -r '.items[] | select(.status.conditions[].status!="True") | {apiVersion: .apiVersion, kind: .kind, name: .metadata.name, helm_release_status: .status.atProvider.releaseDescription, cluster_self_link: .status.atProvider.selfLink? }'
The above command could be customized to provide any details that are helpful to team members responsible for triaging the Crossplane resources. In this case, we are providing the reason that the helm release is failing, as well as the cluster selflink for users to quickly jump to the GCP console to further debug the issue.
Adding this command into a RunWhen TaskSet is straight forward since we can take that same command and leverage the k8s-kubectl-run codebundle to perform the task for us (or for other users of the map).?
- Search for the k8s-kubectl-run codebundle and select it:
- Configure the codebundle with the command outlined above:
- With the configuration committed and merged, run the TaskSet from the Map and verify the output:
Workflow: Automatically Running Tasksets from SLO Alerts
We now have the following components in place to report on the service health of our Crossplane managed resources within RunWhen:
- A Service Level Indicator to report on unhealthy Crossplane managed resources
- A Service Level Objective to alert on failing resources and to display our error budgets throughout the month
- A TaskSet to provide details about the failing objects
The final step is to create a workflow that will automatically execute our TaskSet when the SLO is alerting. To do this, we navigate to workspace settings and create a new workflow:?
Conclusion
This post was intended to demonstrate two methods for measuring the health of Crossplane managed resources, which are custom resources within our Kubernetes environments. While this article demonstrates how to use the features of the RunWhen platform to perform this, the use of kubectl get and kube-state-metrics are available to any Kubernetes user that has commandline access or access to a Prometheus instance that is collecting metrics from a Kubernetes cluster. Using these methods, combined with the RunWhen platform, any Kubernetes platform operator can easily use Service Level Indicators and Service Level Objects to automatically triage failing custom resources within a cluster.?
A note about codebundles
All codebundles that are authored and maintained by RunWhen are found at https://github.com/runwhen-contrib/rw-public-codecollection. All codebundle code is open-source software and open to refinement, enhancements, etc. All codebundles are written in robot framework and are extremely flexible for creating Service Level Indicators and TaskSets that other tools may not be able to support. These codebundles are the foundation of the RunWhen Social Reliability Engineering platform, which allows organizations to use and contribute to community authored service health monitoring, triage, and troubleshooting tools and approaches.?