Are we losing the ability to oversee and understand complex system's states?

Are we losing the ability to oversee and understand complex system's states?

Short answer: yes, I do believe so.

Before jumping into the elaboration of the answer, I would need to clarify something: in our context oversee does not mean observe.

That is important since one might argue that observability has never been as advanced as it is now. However, in IT in general and DevOps in particular, observe refers to the ability to define, collect, and monitor metrics (mostly), define certain rules, and then act based on a combination of rules and current metrics values. Something largely covered long ago by the good Control Theory .

Having clarified that, let’s then define oversee in our context as the ability to query and get information from a set of systems as a whole, with the possibility of defining intrinsic relations between those systems with a given meaning. We are not describing anything new here, since it is mostly the very definition of Ontology .

Let’s explore it with a concrete example.

Imagine that you work for a company that has 10 internal applications that keep the business running. Each application has its own code repository, packages, releases, scheduled resources (compute, store and networking), log entries, metrics/indicators, and backlogs.

There are likely disparate systems deployed in that organization to support each application requirement, for example, GitHub hosts code repositories, Kubernetes is used for scheduling containerized workloads, and the logs collected from applications are stored in ElasticSearch.

The question is: how do you browse application-related information in each system? How do you make the connection to determine what log lines are related to what container running in Kubernetes that was released by building some specific codebase that includes a collection of Backlog tasks?

Let me answer that for you: It is mostly done “by hand”, that is, by accessing the system’s APIs or their web consoles one by one. Let’s face it.

It is shocking to realize that during the last decade we have substantially improved the way we provision infrastructure using code, as well as build and release applications but, in turn, we have degraded the overall development and maintenance experience by making the tech landscape almost impossible to understand, where a simply “Hello World” application involves an insane amount of steps from the IDE to PROD.

The forgotten “Design Phase” is still good

In my view, the origin of the problem is the lack of design. Having spent 25 years in the Software industry, I had the opportunity to witness how first Cloud departments and later DevOps and SRE teams were created. The conclusion is that less than 10% of all teams I worked with during the last decade had a proper design that covers the application lifecycle, from the inception to the decommissioning.?

That is not, in my opinion, due to a lack of talent in the DevOps space. The grew in size and complexity due to requirements coming from developers and this (sometimes irrational) rush to use new systems, frameworks, and techniques, put DevOps teams under a lot of pressure.

The general perception is that “design” means bureaucracy and limitations. Once the design indicates how an application interacts with the rest of the technological landscape that supports its lifecycle, most developers believe that it is a kind of trap that will prevent them from innovating and moving with the desired degrees of freedom.

However, that does not have to be the case. In fact, it is quite the opposite. If we ask developers about their complaints, I am quite sure that “difficulty to interact with the infrastructure and tooling” is going to rank pretty high even in organizations that claim to have an open Cloud environment with a self-service maintain-it-yourself approach.

Going back to the point, we should think about designing centralized IT/Cloud/DevOps Data Structures and Models that capture the essence of how each organization handles the application lifecycle being flexible enough to quickly adapt to underlying changes.

The immediate benefit of having such a model is that developers and other users in need of interacting with systems, won’t have to learn and get access to them, but they will in turn interact with a minimized conceptual model, without having to care whether code is hosted in GitHub or a different git repository, and whether the containerized application is running in a Kubernetes environment or just on a plain VM with Docker.

Backstage and the irruption of IDP (Internal Developer Platform) as a possible solution

As you can imagine I am not the only one thinking about this problem and how to solve it. In fact, some time ago, Spotify made public the platform they use to organize its applications and services and to aggregate all information relevant to developers, called Backstage .

Since its appearance, it gained a lot of popularity and adoption, even making IDP an industry with companies providing services around the concept and even competing with Backstage.

Backstage has three main parts: Service Definitions, a Catalog, and Plugins that allow users to interact with other systems' interfaces from within Backstage portal.

Even though I believe that it is a great platform, I found that gluing and interacting with all the pieces may be challenging, and I still long for the easiest and more flexible mechanism for designing Models (partially covered in Backstage by catalogs).

Schema based models

One thing that I started picturing in my mind is having a sort of relational database, composed of schemas, where each entity (table) represents another entity of a target system, in a straight or even transformed way.

I started working on that while looking for the best alternative to abstract and simplify Kubernetes resources, with the ability to query them and get aggregated results, for example, Deployments of several clusters that match certain SQL-like filter criteria.

Then I thought “what if developers or other systems can interact with services and applications the same way we interact with regular structured data?”.

In other words, I wanted to open the possibility of CRUD operations on a complete infrastructure using plain-old SQL style queries, supporting relations (JOINS) between entities that represent disparate systems.

After investigating and testing some PostgreSQL extensions and MySQL Engines, I came to the conclusion that none of them satisfies my expectations and vision of what a DB Engine with those capabilities would behave.

Connecting disparate systems

The hypothetical DB Engine must support what I divided into 2 main entity (tables) types:

  • Descriptive: Are entities that describe elements of a Model which do not necessarily exist already in upstream systems. For example, some aspects of an application, like owners, developers, components and metrics are not stored anywhere therefore we should be able to create them as in a regular database model.
  • Virtual: They are accessed like a regular table but actual information is fetched from upstream systems. For example, an abstract Code Repository entity can be mapped to a GitHub Repository.

Which opens up the possibility of designing something like the following:

If that was possible, then we can easily get information about the real and current state of our top level system, in an structured and easy-to-consume way, without having to digest complex and unhelpful details of the underlying system.

Implementing the Engine

As mentioned earlier, I was not able to find a solution that satisfies all the requirements.

Reason why I could not test the concept in a real case scenario and decide whether it is feasible or not.

Therefore, I finally opted for defining and implementing the engine with some minimum, top level requirements, as follows:

  • Support for Descriptive and Virtual entities as defined above.
  • Operations must be consistent, like in a regular relational database.
  • Avoid having specific logic to interact with other system’s API in the engine but be able to inject the logic at runtime.
  • Define configuration and interfaces in a declarative way which must be simple enough to be generated by a Generative AI.
  • All the injected logic, grouped in Modules, must follow CI/CD principles, that is, released as versioned artifacts.

The AppModel

In an hypothetical organization we have the following problem: Cloud team is only one engineer and application/platform developers demand resources and changes every single day. Conclusion, the cloud engineer can’t handle tickets in a timely manner and, for the sake of having things done as quickly as possible, a lot of ad-hoc resources are provisioned, which are difficult to manage, monitor, replicate and, more importantly, calculate their cost. In short, as this article’s title suggests, oversee and understand.

So the engineer starts designing a solution that shall allow querying the infrastructure and updating it while also exposing endpoints to interact with the underlying infrastructure without having to access it directly, so app teams can include them in pipelines, portals, etc.

What the engineer knows for sure is that App code is hosted in GitHub and executables run in Kubernetes, besides, most applications have components.

Components are the key element, since the application lifecycle revolves around them. Having an universal component ID would help to trace a component within the infrastructure.

At this point the engineer realized that the definition of Application, Component and the map tables (JOIN) do not exist and must be created in a Database.

However, CODE_REPO and DEPLOYMENT information are in GitHub and Kubernetes respectively, meaning that we can use their API.

New Database will then have 3 data source types: a relational database, Github and Kubernetes.

In the case of Kubernetes, there could be more than one cluster, and more coming in the future.

Helping the Engineer to define the VDB

We solved the problem for the Engineer and pushed the solution here

All the heavy lifting is done by the engine itself and, as you will notice, we only define Data Sources, Translator (bridges), Schemas and Endpoints (actions and queries) in a declarative way.

If you like the idea and find it somehow helpful, please start a discussion with questions so I can focus documentation on relevant aspects.

Coming back to the point of the article, how can this solution help to oversee and understand?

This endpoint is a really good example of both, let’s explore what it returns:

{
           "app_name": "my-new-app",
           "cluster": "my_demo_cluster_1",
           "component_id": "c041faac-00dc-431c-95e5-8d5e45abb100",
           "environment": "env",
           "readyreplicas": 1,
           "component_name": "app-backend",
           "repo_location": "https://api.github.com/repos/demo/app-backend",
           "deployment_identifier": "6zLI2VaEpzGnzwNEhOhAIB3PrbDcN8UpG6Y41cMkvJ...",
           "availablereplicas": 1,
           "deesiredreplicas": 1,
           "app_id": "3c0de61c-0839-4500-aa1a-64c11c0cad0b"
       }        

For a given app, we will get all its components, the location of the repository and the status of the deployment in a simple document, without having to write complex bespoken integration middleware.

Another good example is this endpoint , which adds a new component to an application by making the necessary wiring by only giving a couple of fields as input.

Conclusion

Regaining governance and operational capacity over our infrastructure as well as keeping costs under control, while making it standard for the whole organization, is an industry’s debt.

It does not necessarily mean “restrictions”, however, it does mean “good design”, which helps to understand what we are overseeing by assigning a meaning to the entities involved and their relationships.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了