Navigating the transition: adopting Azure Linux as LinkedIn’s operating system

As of April 2024, Azure Linux is the operating system running nearly all of LinkedIn’s servers, virtual machines, and containers today. We migrated most of our fleet to Azure Linux as a key part of LinkedIn's evolution in building a modern compute stack, workload orchestration, and ML workload platforms. 

The move to Azure Linux supported two critical goals: providing a modern, secure operating system to reliably serve over 1 billion LinkedIn members worldwide; and delivering innovative new AI-powered features to members faster. Beyond these goals, other critical factors in our decision were cost-effectiveness, customization, scalability, community support, and compliance.  

In this post, we’ll detail our migration journey including our goals, challenges, key steps in the process and performance monitoring strategies. 

Assessing the need for change 

The end of life for our CentOS 7 operating system (OS) was a primary driving force for our move to the more modern Azure Linux distribution.  

Adopting Azure Linux fleetwide helped us address numerous technical challenges, including reducing friction for teams when migrating their apps. We could also benefit from modern security and performance features in the kernel on the latest hardware. 

Technical challenges 

As we considered a new OS, we felt it was important to continue running the same Linux distribution across most of the LinkedIn platform, as we had with CentOS 7, to provide a stable and predictable environment. However, the new operating system needed to address some key technical challenges, including:  

User space: Legacy distributions of CentOS 7 suffered from outdated user-space. Many modern apps require modern system libraries, the latest SystemD features, better package management, and most notably better performance for certain workloads. 

Bootstrap time: Bootstrapping a large OS image took a long time, while putting pressure on network infrastructure with repeated installation of many default packages. Adopting Azure Linux allowed the opportunity to reflect and implement a completely different approach to build and bootstrap OS on bare metal servers and containers. As part of the project, OS images are now pre-built with most of the key components present in the image. Host specific configuration converges by config management tooling after OS image is applied. Bootstrap time went down from over an hour to 10-30 minutes.  

Security updates: As part of the compliance requirement driven by internal policies, the frequency for security updates to backport and integrate into OS images needs to be within a 30-day period. This requirement sets a high bar on how fast OS updates can be onboarded, verified, released and deployed. An OS Upgrade Automation project helped to orchestrate OS upgrades using LinkedIn MaaS (Metal-as-a-Service) with less developer engagement. 

Modern capabilities: The impending end of life for CentOS 7 prompted many vendors to cease support, necessitating a shift to a more modern operating system. LinkedIn’s move towards a more modern operating system aligns with our various organizational needs, including cloud-native applications, containerization, and specific feature requirements that newer distributions provide out of the box. 

Vendor support: The lack of adequate community support was another reason we looked at other distributions. CentOS 8 was originally planned to receive updates until 2029, following the traditional model. However, with the shift to CentOS Stream, users felt uncertain about the project's direction and the timeline for updates. This uncertainty created some concerns about the reliability and support of CentOS as an operating system. 

Firmware updates: Modern hardware often requires firmware updates for optimal performance. The storage team, among others, sought a contemporary OS to leverage these updates. Azure Linux's adoption facilitated this, allowing for a more modern and efficient infrastructure. 

Business requirements 

The business requirements for the transition to Azure Linux centered around several key factors:   

Compliance: LinkedIn is subject to regulatory compliance requirements that mandate the use of secure and supported operating systems. Maintaining OS support helps LinkedIn adhere to industry standards and regulations, avoiding potential legal and financial consequences. 

Strong vendor support: Having support from the OS vendor or a reliable support provider ensures that businesses have access to expert assistance when facing technical issues. Vendor support includes help with troubleshooting, bug fixes, and general inquiries related to the operating system. 

Cost efficiencies: Periodic license renewals and support from third-party vendors gradually increased, while LinkedIn saw more value in partnership with Microsoft. Cost savings was one of the key objectives. 

Robust security: Regular security updates and patches are essential to address vulnerabilities and protect systems from potential threats. OS support ensures that security patches are promptly released and applied to keep the infrastructure secure. Security considerations were important to protect member data and harmful business disruption. 

Future proof: Being competitive means having a modern platform and tools to respond to the latest trends. Social platforms like LinkedIn should be one of the drivers of the change, while adopting modern innovations in the tech field. Making LinkedIn OS platform futureproof is an important investment in the ability to quickly adapt to any modern technology developments. 

Planning the transition 

For the migration to be a success, we needed to create strong alignment across the involved teams. Strong stakeholder collaboration is one of the cornerstones of LinkedIn success. We took the time to identify common trends in requirements across teams and align work performed by partner organizations. This gave us diverse perspectives on how teams were using Linux, enabled migration ownership and commitment, aligned organization goals, and established communication channels, among many other benefits. 

Pilot programs 

Our Infrastructure team was one of the first teams to fully migrate certain services to Azure Linux. The team oversees Core services, making it an easy choice to onboard the new OS. Additional teams were pulled into the pilot project to ensure Azure Linux would be usable on LinkedIn servers, including: 

  • The Systems Software Engineering team spearheaded Azure Linux adoption, by providing package management infrastructure running on Azure Linux while allowing for safe transition by other teams.  
  • The Information Security team assessed and made necessary updates to the security stack. 
  • The Services Infrastructure team used an experimental Azure Linux build pool to bootstrap most of the key packages.  
  • The Configuration Management team made custom configuration management modules more flexible in configuration of multiple OSes. 
  • The Productivity Engineering team made Developer VMs available as an early pilot for users. 

These pilot programs identified areas for improvement, helped us document necessary changes, and allowed the team to follow in each other's footsteps so the migration could move much faster. The expertise our core team accrued helped us develop a centralized approach to address most of the challenges the teams encountered. 

Implementation 

The pilot programs kicked off our implementation process and helped us establish stability with Azure Linux. This phase involved replicating package repositories and preparing the hardware auto-provisioning systems. We pinpointed essential packages and configurations for host initialization, with most configurations transitioning to the FireBird Image Build and post-imaging Configuration Management bootstrap processes. 

The Tools team facilitated Azure Linux variant builds for applications, while the Security team updated tools for proper host initialization. This collaborative effort ensured that the necessary tooling and security measures were in place to support the new operating system across LinkedIn's infrastructure. 

The CI/CD OS Testing team played a crucial role in testing and validating Azure Linux, leading to its general availability. The final step was the mass re-imaging of servers by various teams, marking the completion of the migration process. The high-level phase representation is shown in Figure 1 below.

High level Azure Linux migration phases
Figure 1. High level Azure Linux migration phases

Infrastructure preparation 

The first host was born out of a container built on a laptop. While we managed to get a VM with Azure Linux internally, it was not flexible enough to break things many times. A typical container runs one process at a time. Enabling SystemD inside container, and validating various components required by MaaS helped us prepare an essential set of internal and external packages such as most UCM (Unified Configuration Management) packages, customized puppet, SSL key management packages, and others. 

Our MaaS team used this essential set of packages to integrate them into LinkedIn's MaaS automation. A few of the remarkable changes included moving to SystemD network configuration daemon and performing network discovery and registration. 

Choosing the XFS filesystem was an interesting challenge: it was not originally native to Azure Linux and configuring software RAID systems. Based on our system tests, XFS was a better performing system for most of our applications with a notable exception: Hadoop. It also felt more stable, comparing the number of issues that affected LinkedIn between XFS and EXT4.  

Our MaaS team was also moving towards a modern image-based installation process, during which the bare metal OS was baked into an image file and burned to disk. The Microsoft bare metal team was developing the Azure Linux Image Customizer at this time. This gave us the opportunity to align LinkedIn releases distribution versions with the Microsoft release process. Using Image Customizer, we could automatically perform LinkedIn releases from Microsoft releases along with LinkedIn customizations. 

LinkedIn uses various approaches to stateful and stateless applications. One of the easiest targets for migrations were stateful applications. The Tools team leveraged our existing Multi-Product Variant concept to create additional Azure Linux topologies and implement deployment workflows. 

Teams engaged in early migration phases
Figure 2. Teams engaged in early migration phases

Engaging teams during onboarding 

Azure Linux offered our teams a sense of familiarity mixed with novelty. Our core team delivered a series of prototype hosts, which came with a pre-set operating system, to our pilot teams. These hosts helped the teams get accustomed to the new OS, experiment with it, and enjoy the experience of discovering a modern operating system. 

The core team also extended personalized, in-depth assistance to help internal partner teams develop compatible software packages and set up operating system components according to the unique needs of different applications. To prepare engineers for the transition to Azure Linux OS, we shared insights from the pilot programs during technical talks, team meetings and casual office conversations. 

The transition significantly improved our deployment speed and system reliability, directly enhancing our ability to innovate and respond to market demands. The seamless integration with familiar tools boosted productivity, while extensive support from Azure Linux support team helped us minimize downtime. As a result, we’ve strengthened trust and confidence in our engineering capabilities across our organization, which helps us make the case for future technological advancements and gives us a competitive edge in our operations. 

Data migration 

A sizable portion of the applications at LinkedIn are stateless, with deployment platform supporting seamless migration of the applications to pools of hosts with Azure Linux. Once an application supported the new OS, OS upgrade automation moved it to hosts with Azure Linux, and subsequently reimaged hosts that did not have any applications deployed. 

Stateful applications typically have data storage partitions separate from OS partitions. The MaaS team implemented a feature to maintain data partitions, while applications teams ensured all the required components were present in Azure Linux. MySQL database migration is one example of when many packages had to be rebuilt to support a new OS. 

In some cases, teams had to refactor application design to minimize downtime. For example, DNS-based failover method had to be replaced with a better suited application topology based one. 

Overcoming technical implementation challenges 

Encountering technical obstacles is a natural aspect of any migration process. Below, we’ll discuss our strategies for addressing some of these hurdles. For instance, when data compatibility issues arose, we tailored custom scripts to ensure seamless data integration. Proactive communication with our Azure Linux support team allowed for swift identification and resolution of system discrepancies.  

Change management 

Azure Linux onboarding significantly increased the velocity and number of changes introduced to our production infrastructure. The migration to Azure Linux highlighted a challenging aspect of our existing change management process: the need to reduce changes that span multiple data centers or multiple groups of application services, which we call “global changes.” They were identified as one of the main reasons for service disruptions. 

To ensure the quality of the code and application deployment processes, LinkedIn engineering teams used diverse types of integration testing for applications and regular load testing to capture the behavior of critical user-facing components under stress. To mitigate accidental global changes, we introduced an initiative that strictly refused changes that might result in global impact. This forced teams to plan gradual transitions using a percentage of the deployment fleet or limiting the scope to data centers. 

Containers 

Even after making Azure Linux run on hardware machines, building containers that fulfilled LinkedIn security requirements for containers proved to be challenging. 

We used Microsoft-provided base image in the early adoption stages to build required packages, but LinkedIn production containers cannot use base images created outside of LinkedIn.  

We used the Container Image Builder tool to create CentOS and RHEL images, which ran on the RHEL7 host. Since the tool used OS package repository database for container built, it was incompatible with the Azure Linux repository. To build the first container base image with Azure Linux, we converted the package repository database during image creation. 

Once the base image was created, automation was able to build different container flavors based on the OS variant defined for a given application. 

Hardware drivers 

DKMS (Dynamic Kernel Module Support) was commonly used with legacy distributions used in LinkedIn, which provided several advantages, such as automatic rebuilding based on kernel version, customization and distribution independence.  

However, since Azure Linux kernel requires drivers to be signed by Microsoft, DKMS would not work. Since using signed drivers increases security, we worked with Microsoft to make drivers available for all the hardware SKUs used in LinkedIn. Now, LinkedIn relies on upstream Microsoft drivers built for Azure Linux. 

Azure Linux developer VMs 

The Productivity Engineering (PE) and Dev team performed an outstanding job building the dev pipeline for Azure Linux.  

Traditionally, LinkedIn hosted developer VMs with the OS at parity with production. This used to be a full-fledged CentOS desktop VM with a window manager that could be accessed by RDP and ssh connections.  

Transitioning to Azure Linux meant we would lose GUI access since Azure Linux does not currently support a window manager. The solution was to remotely connect IDE (integrated development environments), such as IntelliJ and Visual Studio Code, to the Azure Linux Developer VMs. The VMs are deployed across four different geographical regions that are closest to the users requesting them with a self-service tool. VMs with GPUs are provisioned on custom request. 

Since Azure Linux Marketplace Images are owned by Microsoft, we could directly use the Marketplace images, eliminating the need to maintain an OS image build pipeline. All LinkedIn customizations are then applied on top using Puppet. 

We also used this opportunity to leverage Systems Software Engineering’s yum repositories for patch and LinkedIn Developer Tools (GULL) distribution. This eliminated duplicate RPM packages repositories that were previously maintained by PE. 

The Azure Linux Developer VMs are patched automatically within a 30-day patch cycle. 

During the pilot phase, early adopters could run Azure Linux Developer VMs. We gathered and addressed feedback before going to general availability (GA). Today, all VMs with legacy OSes have been gracefully shut down and deleted after a delay. We fully transitioned out of CentOS Developer Desktop VMs and are currently running a fleet of over 1.5k (and growing) Azure Linux Developer VMs. 

Remote development 

Remote development (RDev) is a service that enables engineers to use containers for their application development tasks, giving developers a production-like development experience. Providing an Azure Linux RDev experience equivalent to CentOS/RHEL was paramount for developer confidence in migrating to Azure Linux.  

The RDev container is self-contained with all the components required for an application to function in a dev environment. The RDev service takes in a container base image and applies customizations to make the container image suitable for developer workflows. We had to make the Azure Linux base image compatible with this workflow. There were issues when we were trying to build Azure Linux RDev image, for example: some RPM packages missing, older container runtime preventing RDev container to run, build tools not working inside the RDev image. 

By providing functional Azure Linux RDev early, before mass migration started, helped to maintain developer velocity through migration and quality of applications running on Azure Linux. 

Technical assistance 

Introducing Azure Linux into LinkedIn was a big endeavor that would have been impossible without high quality technical guidance for the new OS. We embraced Azure Linux support in various ways ranging from teaming up with Microsoft OS groups, to initiating support forums. The core team provided comprehensive documentation that we incorporated into our internal knowledge base. This allowed our engineers to quickly familiarize themselves with the new system and utilize its full potential, ensuring a smooth and efficient integration into production use. 

The core team also engaged subject matter experts in different fields to provide necessary support on issues such as rebuilding vendor specific libraries and applications, tuning memory consumption, filesystem performance, kernel tuning and automated configuration of a completely new network stack. 

We increased automation efforts to help us address the growing volume of application updates. Python-based applications, being the most prevalent, required substantial module upgrades, necessitating migration to newer versions of Python and SSL libraries upgrade. Automation was put in place to mass upgrade repositories for most of the common use cases. For more complex situations, the tools support team offered exemplary assistance. 

Monitoring and continuous improvement 

Performance monitoring 

While transitioning to Azure Linux, LinkedIn saw an unprecedented transformation of internal monitoring tooling. Outdated monitoring infrastructure was phased out in favor of a more contemporary setup that united both software and hardware monitoring into a single, streamlined interface. 

The new monitoring and performance observability stack improved the quality of our performance analysis of applications. Routine app performance evaluations indicated improvements in some areas while highlighting reduced performance in others, highlighting the need for performance tuning. For instance, we found that XFS’s efficiency on RAID setups took a hit due to the default storage allocation configurations. By conducting thorough storage benchmarks and tweaking the XFS settings, we were able to fine-tune and enhance our storage performance. 

The adoption of new tools for monitoring, such as the inQuery and inLogs platforms, proved invaluable during the migration period by providing additional data points and logs that facilitated the troubleshooting and fine-tuning of different operating system components. 

Feedback loop 

We established several channels to capture our Azure Linux users’ feedback. Teams could discuss their challenges during periodic meetings with our core team SMEs, although the lively discussions unfolded in the channel we dedicated to the Azure Linux migration. 

Every topic of discussion was captured in Jira and categorized. We performed periodic trend analysis; the data results helped us adjust our migration pace. After 95% of the migration was complete, we performed a post-migration analysis. We conducted “lessons learned” sessions to help us formulate additional requirements for major OS upgrades in the future. 

Migration in numbers 

To monitor our migration, we used a common metric to track the number of hosts with Azure Linux. Thanks to an OS Upgrade Automation implemented before the migration to Azure Linux, the migration has proven to be trivial.  

Azure Linux migration trend, in % of the fleet
Figure 3. Azure Linux migration trend, in % of the fleet

Conclusion 

The migration of LinkedIn’s fleet to Azure Linux was a strategic decision that entailed numerous considerations and challenges. Its successful execution yielded substantial benefits ranging from cost savings to enhanced security and flexibility. We achieved both critical goals: provide a modern, secure operating system to reliably serve LinkedIn members worldwide; and deliver innovative new AI-powered features to members faster. 

By embracing open-source solutions, LinkedIn, in partnership with Microsoft, harnessed the power of community-driven innovation and unlocked new levels of efficiency, agility, and competitiveness. Nevertheless, careful planning, comprehensive training and ongoing support were essential to making the transition smooth and maximizing the long-term value of the migration. 

Acknowledgements 

All this work is not possible without the contribution of many individuals. These are the top collaborators on this project that brought Azure Linux from Microsoft to LinkedIn. 

The seeds for choosing Azure Linux started with LinkedIn's Team consisting of Tim Crofts and Franck Martin who were invited to discuss their Linux fleet management with Kay Williams, a member of Microsoft's Linux Systems Group. 

LinkedIn Team: 

Executive Sponsors: Bruno Connelly, Neil Pinto, Milind Talekar 
Linux Strategy Leaders: Nick Berry, Franck Martin 
Linux Build Leader: Andreas Zaugg 
Systems Infrastructure: Zaheer Shaikh, Tim Crofts, Cliff McIntire, Ievgen Priadka, Maanas Alungh, Harish Shetty, Riya Agarwal, Vaibhav Singh Gour, Harpreet Lalwani, Rishika Wadhera, Sreegopal P, Pawan Pandey 
Hardware Provisioning: Nitin Sonawane, Rohit Jamuar, Bubby Rayber, Phincy Leo Pious 
Configuration Management: Martin Minkus, Lovell Felix 
Testing/Deployment: Ramadass Venkadasamy, Adam Debus, Rob West 
Dev and Tools: Dan Hicks, Samir Tata, Sweekar Pinto 
Hardware Certification: Kyle Reid, David Mackey 
TPM: Sameer Makada, Padmaja Mummaneni, Sean Patrick 

Microsoft Team: 

LinkedIn would like to thank the Azure Linux team at Microsoft notably the following individuals (in no order): Chris Co, Allen Pais, Rachel Menge, Deepu Thomas, Adelaida Amatangelo, Daniel McIlvaney, Henry Li, Adithya Jayachandran, Frank Swiderski, Ravi Rao, Brian Telfer, Jon Slobodzian, Jim Perrin, Kang Su Gatlin, Suresh Babu Chalamalasetty, Tyler Hicks, Roaa Sakr, Chris Gunn, Neha Agarwal, Frank Swiderski, Krishna Ganugapati.