In LinkedIn’s fast-evolving data infrastructure, efficient resource management and monitoring are crucial. Resource provisioning used to be difficult, with application developers coordinating directly with infrastructure teams, often leading to prolonged back-and-forth communication, delays, and inefficiencies. Developers would spend days—sometimes weeks—handling infrastructure requests, while infrastructure teams faced daily repetitive manual work. To address this, we built Nuage, which provides an interface between the infrastructure and developer teams to simplify resource management and establish best practices and processes. 

Earlier iterations of Nuage focused on providing self service capabilities across over 30 infrastructure platforms, including storage (Espresso, Venice, Pinot, MySQL, Ambry), streaming (Kafka), managed search (Hosted Search), and managed stream processing (Samza). But as the scale grew and requirements became more complex, Nuage evolved from offering self-serve capabilities to a full fledged control plane solution that manages the entire resource lifecycle with key features like resource discoverability, access control and policy enforcement. 

In this blog post, we’ll cover the evolution of Nuage as a control plane framework. Readers will learn more about various control plane requirements across Linkedin and how Nuage provides a centralized framework for building control planes.

Why we need a control plane

A control plane for data infrastructure could be defined  as "a scalable platform designed for global provisioning, management, and governance of resources." However, this lacks the depth needed to convey its full purpose and importance for data infrastructure. A clearer understanding emerges when considering the specific responsibilities it fulfills.

Map of control plane requirements
Figure 1. Control Plane Requirements

For the teams building infrastructure, flexibility and control are crucial. Before Nuage, application developers navigated a cumbersome process of directly engaging with infrastructure teams for resource provisioning and management causing delays and inefficiencies. They needed an interface that encapsulates the essential business logic for infrastructure management, exposing only the components relevant to users. Beyond infrastructure provisioning, these teams often require streamlined processes for resource management. For instance, authorization mechanisms to ensure only authorized personnel can modify resources and approval workflow integration to ensure proper oversight with manual intervention for critical requests like a resource's quota increase. To keep systems running smoothly, infrastructure teams also rely on performance metrics like latency, error rates, and usage statistics, which help them optimize performance and resolve issues quickly.

Application developers benefit from easy resource discovery and management. A well-designed control plane allows developers to check resource usage, such as quota levels, which helps identify cost-saving opportunities. Beyond resource creation, developers need to define and manage access control, providing appropriate access to their resources. Insights into cost and usage patterns also allow developers to make informed decisions, ensuring efficient resource allocation and budget management.

Resources are designed to meet specific requirements to help ensure accountability and traceability. Each resource is assigned a designated owner to clearly assign responsibility.  In addition, resources are tagged with compliance info, such as whether they contain personally identifiable information (PII), along with defined purge policies. A control plane enforces these regulations from the moment a dataset is provisioned. Robust auditing capabilities are also crucial to track all changes and provide transparency.

Nuage offers these capabilities with a comprehensive control plane framework, allowing infrastructure teams to build control planes for their infrastructure resources while adhering to the company's policies.

Brief history of Nuage and motivations for Nuage 3.0 

Nuage 1.0

Nuage was initially developed to reduce the manual work involved in provisioning and managing data infrastructure resources through a self-service platform. The vision was to centralize essential capabilities—like authorization, discoverability, search, and auditing—in a single control plane, so infrastructure teams wouldn't have to build these functionalities independently. The first version of Nuage was a monolithic service, offering control plane endpoints for various infrastructure resources like Espresso, Venice and Kafka. Each type of resource had its specific business logic bundled in separate modules, while common functionality was managed centrally.

Nuage 2.0

As LinkedIn’s data infrastructure expanded, so did the number and complexity of supported resources. The number of platforms increased significantly, from a few initial services to more than ten distinct infrastructure platforms. With this growth, the business logic required for each resource also became more complicated. For example, database creation in the storage system now included cluster selection algorithms to optimize for resource utilization. This added complexity created a bottleneck, as the monolithic design couldn’t scale efficiently across multiple teams. 

To address these limitations, we introduced Nuage 2.0, shifting the control plane development to a decentralized model. Core functionalities, such as authorization and search, were packaged into a library, allowing infrastructure teams to build their own resource providers independently. However, dependencies on a single shared persistence layer and tight coupling between the library and resource provider components continued to present challenges. For more background, see previous tech blogs about solving the control plane problems at scale. Although Nuage 2.0 introduced a decentralized approach through the nuage-sdk this architecture presented new challenges:

  • Security: Nuage metadata exists as a shared storage across all Resource Providers. There is no tenant isolation while reading/writing to this shared storage. Inefficiencies arose because performance in one resource provider could directly impact others, as suboptimal queries from one provider would slow down the entire system.
  • Onboarding experience: Currently, all the Resource Providers make use of a shared persistence layer (via SDK) to persist metadata. Onboarding onto Nuage requires understanding of Nuage configs and deployment topology, which complicates the learning curve for infrastructure partners and often involves Nuage engineers' involvement for onboarding. 
  • User experience: The Nuage client layer leverages auto generated UI for building UX. However, default layouts use screen space sub-optimally with less scope for customizations. UX controls are not always intuitive, impacting customer satisfaction.
  • Ownership: Tight coupling of control plane logic (Nuage SDK) and business logic in Resource Providers makes it very hard for individual teams to own Resource Providers. A more scalable model where infrastructure partners own their respective resource providers is needed.

Error triaging and analysis: Lack of clear contracts between Control Plane and Resource Providers often delays error triaging, leading to  poor accountability and slower time to resolution.

Chart showing the evolution of Nuage
Figure 2. Evolution of Nuage

Nuage 3.0

To address these issues, Nuage 3.0 was developed with the following goals:

  • Centralized management: Streamlining resource management through a centralized service that exposes uniform APIs and enforces consistent interfaces across all platforms.
  • Decoupling of logic: Separating horizontal control plane capabilities from infrastructure-specific logic to reduce operational overhead and improve scalability.
  • Enhanced security: Implementing stricter access controls via new RBAC model and removing the shared storage to help prevent unauthorized modifications and help ensure secure communications.
  • Improved performance: Optimizing query performance and minimizing latency by establishing better resource API and data modeling practices 
  • Simplified onboarding: Making it easier for new resource providers to integrate by eliminating shared persistence issues and providing a clear separation of concerns.

Nuage's comprehensive control plane capabilities for data infrastructure

Control Plane capabilities
Figure 3. Control Plane capabilities

Nuage offers a comprehensive control plane framework that enhances discoverability, security, resource management, and monitoring for LinkedIn’s data infrastructure. Let’s explore the key features that make Nuage essential.

Client interaction

Nuage provides an intuitive user interface through a web portal, public APIs, and CLI. These tools make it easy for users to interact with the control plane, while a dedicated admin interface simplifies resource configuration and management.

Front door capabilities

Nuage maintains the integrity, security, and efficiency of the system by managing how external interactions are handled right from the start with:

  • Request validation and sanitization: Requests go through a thorough validation and sanitization process before it’s executed, for for security and compliance purposes.
  • Intelligent request routing: Nuage uses smart routing to direct requests to the appropriate environment, such as staging or production, optimizing resource handling.
  • Auditing: Incoming requests are logged. These audit logs are accessible via a user-friendly audit log application, where users can filter through various criteria to track activity.

Discoverability

With advanced search with multi-criteria filtering, Nuage allows users to refine their searches by combining filters like name, environment, ownership, tags, timestamps, and status. Additionally, platform teams can extend search capabilities to include custom platform-specific properties, making it easier for onboarding teams to find exactly what they need.

Resource provisioning and management

Nuage offers consistent CRUD (Create, Read, Update, Delete) interfaces across platforms. This contract-first approach ensures that APIs are designed and standardized before implementation. Payloads are validated for mandatory fields, and asynchronous operations are handled smoothly to ensure reliable resource provisioning.

Access and policy control

To prevent unintended changes, Nuage enforces strict access controls, including authorization checks (only owners or admins can perform certain actions), ACL checks, and multi-level approval workflows for critical tasks. It also supports phased rollouts and live traffic checks to ensure safe deployment across environments.

Monitoring

Nuage equips infrastructure teams with real-time insights into system performance through key error and latency metrics. It also generates managed dashboards with pre-configured critical alerts during resource creation, providing greater visibility and enabling proactive management.

Nuage 3.0 architecture

Nuage 3.0 architecture
Figure 4. Nuage 3.0 architecture

Resource provider

A resource provider exposes APIs to provision and manage infrastructure resources, such as a Kafka resource provider manages Kafka topics. Each provider integrates with Nuage's resource manager and adheres to operational contracts to ensure uniform resource management. These API and data model contracts guarantee a consistent experience across all resources in Nuage.

While resource providers typically expose CRUD (Create, Read, Update, Delete) operations, they may also offer custom actions, such as increasing resource quotas, registering schemas, fetching resource insights and cost data, configuring the alerting mechanisms, and monitoring system health and traffic patterns and surfacing it via dashboards.

Resource provider contracts and guidelines

The process starts by defining resource data models using data contracts and creating resource classes to expose RESTful endpoints. In Nuage 3.0, resource providers must implement API and data model contracts, adhering to Nuage defined contracts. This ensures consistent interface design by:

  • Organizing APIs around resources to create intuitive and easily navigable endpoints.
  • Exposing standard CRUD operations providing consistent Create, Read, Update, and Delete (CRUD) endpoints to interact with resources
  • Modeling resource relationships for better performance such as defining relationships between resources (e.g., parent-child) to allow efficient data retrieval and minimize redundant requests.
  • Following consistent URI structures adhering to a predictable, hierarchical structure. For example: “/nuageKafkaTopics” represents a collection resource endpoint whereas “/nuageKafkaTopics/{topic_id}” represents a single entity endpoint. Similarly subresources URI represents the full hierarchy such as “databases/{database_id}/table/{table_id}” 
  • Applying uniform HTTP patterns for requests and responses, with consistent methods (e.g., GET for reads, POST for creation), standard status codes (e.g., 200 for success, 404 for not found), and structured JSON responses for predictability

Nuage resource manager

Nuage Resource Manager (NRM) is a centralized managed service acting as a gateway between Nuage client and resource providers exposing resource management operations. It enables management features for the resource entities, like request routing, authN/authZ, search, validation, audit logging, async workflow management.

Routing

NRM acts as the central hub for all incoming requests, directing them to the appropriate resource provider within the correct environment. It leverages the URI to identify the resource path and determine the resource hierarchy. If a resource provider is registered for that path in the NRM resource manifest, the request is directed to that provider. Moreover, NRM employs intelligent routing logic to ascertain the destination environment where the request should be handled. For example: for a resource id “urn:li:nuageResource:(PROD,KAFKA_TOPIC, id)”, the request needs to go to production service

In accordance with the Network Topology section, each resource provider manages resources within its designated environment. Therefore, if a client in the production (PROD) environment intends to execute an operation on a resource staging (EI) environment, NRM in PROD is responsible for routing this request to the appropriate Resource Provider in the EI environment. This ensures that requests are seamlessly directed to the correct environment for processing.

Authorization

Every resource that exists on Nuage must have an owner associated with it. Resource owners are responsible for operating the resource (allowing them to update/delete it) and ensuring that it's compliant by providing schema annotations and other compliance information required. Owners are recognized as crew, an entity which represents the encoded structure of our teams and organization.

Nuage Resource Manager leverages role based access control to provide fine-grained access management of resources on Nuage. Nuage RBAC helps you manage who has access to resources, what they can do with those resources, and what areas they have access to. Resource provider admins can also create custom roles for the resource providers (via resource provider manifest) , and configure access control based on those roles. By default the following roles exist on the Nuage platform:

OperationVIEWERCONTRIBUTOR
(user with this role has R/W access to the resource, but cannot change role assignments)
CREW_MEMBER
(member of the resource owning crew)
RESOURCE_PROVIDER_ADMIN
(member of resource provider team)
Search & find a resource, view the resource detailsXXXX
Update the resource XXX
Delete the resource XXX
Transfer to another crew  XX

Search

Nuage offers capabilities to efficiently search over resources across different scopes out of the box, without having to query all the RPs for all the resources in all scopes. Nuage Resource Manager maintains a persistent cache (MySQL) to provide faster search. Out of the box search support is provided over common fields: name, environment, ownership, tags, created/updated timestamp, status. 

Resource providers also have the option to extend the search support over the platform specific attributes of a resource. This is done via a Data Model annotation. Nuage Resource Manager uses Resource Provider’s RestSpec to determine which fields to cache.

Cache is updated only on operations that are CRUD. This enforces that resource providers follow certain data and resource modeling guidelines to keep the cache consistent.

Audit Log

NRM offers audit logging capabilities by default. All write operations performed through Nuage are meticulously logged for auditing purposes. These logs are conveniently accessible through a separate audit log application. Users can effortlessly filter through a multitude of criteria including resource name, resource URI, user, call trace id, request method, and date range, providing efficient and thorough monitoring of user activity.

Validations

Nuage offers multiple ways to validate resource schemas and APIs. Basic static validations can be implemented using built-in Rest.li annotations or custom logic in a dedicated validator class. These validations are enforced for all incoming and outgoing data at the schema level.

For more dynamic scenarios, where validation depends on multiple fields, Nuage provides a framework for dynamic validations through API contracts. Resource providers implement the validate method, which is called by the Nuage Resource Manager to collect and process validation errors. These validations can be applied at both field and method levels.

Asynchronous operations

NRM provides a consistent framework for handling asynchronous operations by enforcing Asynchronous API contracts. It provides consistent way to trigger, monitor and manage asynchronous operations via Nuage resource manager irrespective of from where workflow is triggered (NRM or Resource Providers) or underlying workflow engine (Temporal, Airflow, Helix Task Framework, Parseq or any custom workflow orchestrator). Apart from this it provides:

  • Ability to query for all asynchronous operations on attributes such as,which user triggered them, which resource the operation is linked to, all operations in a time period, etc
  • Maintain execution history for all async operations for six months for debuggability and auditing. 

Facilitating data governance & compliance

Data governance is the critical aspect of metadata management, where ensuring dataset discoverability and maintaining metadata integrity are essential prerequisites. In LinkedIn, Data Governance and compliance is enabled by Datahub.

MCE Pipeline via Nuage
Figure 5. MCE Pipeline via Nuage

Being a control plane, Nuage ensures that metadata aspects such as schema, ownership, status are up to date on Datahub at the time of provisioning which also resonates with Linkedin’s shift left strategy. This is achieved by an automated Metadata Change Event (MCE) emission flow as part of CRUD workflows.

Control Plane Horizontal Services

ACL Management Service

ACL Management Service manages access control for LinkedIn’s resources through a user-friendly interface for handling Datavault ACLs. Its key features include ACL deployment across environments, rule creation, management of temporary and permanent access, related ACL navigation, and support for ownership changes. ACL Management Service also enables easy comparison between different ACLs and offers seamless navigation between Nuage and ACLin, streamlining access control management across platforms.

Approval Workflow Service

Approval Workflow Service introduces manual intervention into the business logic by automating approval workflows where human oversight is required. It allows resource providers to assign designated reviewers for requests, ensuring critical decisions are made with appropriate checks. Reviewers can evaluate, approve, and notify clients, who can then take the necessary actions. The system also supports delegation during absence of key approvers, ensuring uninterrupted workflow management. Additionally, Approval service offers real-time visibility into approval statuses, tracks request history, and integrates with communication channels to streamline notifications and updates, enhancing operational efficiency.

Resource monitoring and alerting

The resource monitoring and alerting service enables resource providers to create resource-specific monitoring dashboards with preconfigured alerts. Platform teams can define dashboard templates with key metrics like quota usage and read/write QPS, and set up alerts with resource owners automatically added as recipients. This process is fully automated during resource provisioning, providing users with a ready-to-use, auto-generated monitoring dashboard once provisioning is complete.

Client Interfaces

As infrastructure services evolve, teams require diverse ways to interact with the control plane, ensuring they can easily manage, provision, and modify resources. A comprehensive client interface simplifies the process for both infrastructure providers and users, reducing friction and minimizing the need for manual intervention. By offering intuitive UIs, public APIs, and CLIs, Nuage enables seamless automation, faster onboarding, and consistent operations, catering to a variety of technical and automation-centric use cases.

Nuage portal

We have improved the user experience in Nuage 3.0 by delivering more intuitive and user-friendly interfaces. Compared to previous iterations, Nuage 3.0 features:

  • Consistent design language: Nuage 3.0 offers standardized layouts backed by clear contracts for our partners. The CRUD pages follow a unified design across all platform applications, ensuring consistency. In addition to the page layouts, enhanced navigation simplifies interaction with control plane services. Users can easily view pending approvals on a resource, navigate directly to the ACL page from the resource page, track asynchronous workflows on the detail page, and much more.
  • Metrics driven UX: Nuage 3.0 has made substantial enhancements in performance and reduction in error-rate, propelled by our observability strategy. Each Nuage 3.0 application is equipped with a Live Metrics dashboard which gives further insights into some key performance and resiliency numbers. Recent UX improvements have led to ~40% reduction in Nuage Landing Page load time. New UX provides clear, concise, and actionable error messages with appropriate resolution details (next steps, who to reach out, JIRA)
  • Low code UI onboarding: The Nuage team streamlined UX onboarding for partners by developing an in-house tool called ZenX. ZenX is a self-serve, low-code/no-code UI generation tool that simplifies the process of onboarding new applications with the Nuage 3.0 theme. It allows for easy UI customizations, even for partner teams without dedicated UI engineers, significantly reducing the workload on the Nuage team. ZenX has reduced the onboarding time for creating vanilla CRUD UIs from two weeks down to just three to four days, accelerating development and reducing toil.


Public APIs

All Nuage APIs are public APIs which LinkedIn applications can use to integrate directly with Nuage functionalities. Applications can programmatically create, update, and delete resources using the Java Rest.li client. These APIs enable seamless automation and customization, allowing teams to embed infrastructure management directly into their application workflows.

Command Line Interfaces (CLIs)

Teams are increasingly leveraging Nuage APIs to build custom CLIs for technical and automation-focused use cases. For example, Espresso, LinkedIn’s document database solution, has developed an SRE CLI tool called esretool, which allows users to efficiently create, update, promote, and delete Espresso databases.

Infrastructure as Code (IAC)

We are planning to implement IaC as a future client integrated with Nuage APIs to provide even greater flexibility and provisioning experience via code where developers can simply add the infrastructure resources in their source code.

Benefits

Agility in Partner Onboarding

A key success metric for Nuage is how easily platform teams can onboard and develop resource providers. With Nuage 3.0, zero support is required from the Nuage team to create a new resource provider. Resource provider development is now guided by Data Model and API contracts, eliminating the need to extend nuage-sdk interfaces. Additionally, we have a self serve UI generator, a one-click, low-code UI tool, simplifies building UI applications without prior experience. This reduced MySQL onboarding efforts by over 70%, from 12 to 2 developer months.

Clear Ownership

Infrastructure teams can now fully own their resource providers, a significant improvement over the previous architecture where the Nuage team had to manage and maintain many resource providers. Nuage 3.0 introduces a clear separation between the control plane and infrastructure-specific business logic. Nuage resource manager handles platform logic, while resource providers are dedicated solely to infrastructure-specific business logic.

Performance 

The previous architecture struggled with high UI page load times due to improper resource modeling. Nuage 3.0 resolves this by supporting parent-child relationships between resources, allowing clients to retrieve only the necessary data. This has significantly improved performance and reduced network overhead. For instance, Espresso saw over a 3X improvement in P90 latency for Read flows, dropping from 10 seconds to under 3 seconds, while Kafka experienced a nearly 2X improvement in read latency (8 to 4 seconds) and a 6X improvement in search latency (17 to 3 seconds).

Security

Nuage 3.0 manages metadata through a centralized service, Nuage Resource Manager, which has helped resolve the security issues related to shared metadata storage that existed in the previous architecture. The new architecture also helps ensure secure and reliable resource management across all environments for all clients, as the resource provider operates within each environment, preventing requests from crossing security domains. As a result, the number of firewall exceptions required for the Espresso application has been reduced from 15 to just 2.

Future Considerations

Today, Nuage supports 30+ applications, serving 4.5K monthly unique users and over 35K monthly active users, 100K write operations each month—streamlining resource management at LinkedIn's scale.

Self serve onboarding

We are introducing a new Self-Serve Onboarding feature to streamline the onboarding process for Resource Providers in Nuage. Currently, setting up a basic "Hello World" application that enables CRUD flows requires manual work, including development, deployment, and infrastructure setup, which takes approximately two weeks. With the new Self-Serve onboarding feature, users can complete the entire onboarding process through an intuitive interface that collects all necessary information upfront. Once submitted, the system automatically triggers an onboarding workflow that handles everything, from infrastructure setup to deployment in EI, eliminating manual intervention and significantly reducing the time required.

Infrastructure as code

Infrastructure as Code (IaC) offers standardized, code-driven infrastructure management that simplifies resource provisioning and boosts consistency. We aim to cut developer toil by over 50%, especially in complex, multi-tool environments where users have to navigate multiple interfaces. By adopting an Infrastructure as Code (IaC) approach, we aim to align our infrastructure development with application code development, including source code management, change review, artifact build, versioning, and much more.

Nuage AI Assistant

Nuage assistant will be LinkedIn’s AI assistant for data infrastructure management, powered by Large Language Models (LLMs). It will leverage AI to reduce operational toil, provide developers with quick, accurate responses in under 10 seconds, and deliver insights on optimizing costs, ensuring compliance, and managing resources. By using LLM-driven reasoning to select and execute tasks, it will enhance efficiency across over 30 applications.

Acknowledgements

We extend our gratitude to the many colleagues: Adrish Banerjee, Prateek Singh, Khushboo Sangal, Arnava Agrawal, Allabakash, Amandeep Srivastava, Anshul Sharma, Dhirendra Kumar, Vivek Subramaniam, Lohitaksh Trehan, Abhijit Yadav and the entire team for their contribution, dedication and collaborative spirit all along.

Special thanks to Nishant Lakshmikanth and Ramnik Singh for their exceptional role as a reviewer, offering unwavering support and insightful feedback over the years on Nuage’s architecture and its evolution which has been crucial in shaping our work, realizing the team’s vision and mission, fostering customer adoption, keeping pace with industry trends, and acting as a pillar of strategic guidance since the team’s inception.  

We would like to thank the contributions from all the individuals including our LinkedIn alumni, (names in no particular order) Mohamed Battisha, Vishal Gupta, Terry Fu, Ji Ma, Yifang Liu, Changran Wei, Yinlong Su, Darby Perez, Tyler Corley, and Micah Stubbs.

Special thanks to those who gave their time to review and edit this blog: Bhupendra Kumar Jain, Prateek Singh and Nishant Lakshmikanth

Huge thanks to our management Sandeep Singhal, Kartik Paramasivam, Arun Mahapatro for your strong support as exemplary engineering leaders. Finally, we are grateful for our fellow data infra platform teams for all the ways in which they have supported this work.