Running Operations: NOC Vs SRE way
Venky Chennapragada
Principal DevOps, Cloud, Agile, TOGAF 9 Certified Enterprise Architect, Capgemini
I am sure someone who has some exposure to Operations knows what is a NOC, but for others let me expand, it is Network Operations Center. It is old way of putting all operations staff in one room: the server, storage, network, db,apps and I apologize if I left anyone. They could be 10's of them or even 100's put in a room. And there could be NOC'S in different geographies or centers. I have seen these folks organized into tiers of support and even here follow the hierarchy of command! Tier-1 sitting below close to the large monitors displaying those colored alerts: green, yellow and red on any IT landscape elements state: network/storage/server/db/app. They are the first responders and are supposed to open a ticket in ITSM tool say olden days Tivoli or Remedy may be now ServiceNow and attempt to fix them based on their knowledge/knowledge base within some SLAs. Most of these folks are freshers that a IT service provider fills the room in their offshore center. when it is close to the upper end of SLA they toss the ticket to Tier -2 the folks that are sitting a step above them and they are supposed to resolve them in their own SLA timelines and toss it to Tier-3 siting few steps above them. The Service Management Managers sit on top of all and are the most anxious people running a incident bridge and running from step to step to get updates and filling their bosses on status and ETA.
After a incident is resolved there is a sigh of relief but then comes the investigating team that ITSM calls - the problem management. And they are tasked with finding the root cause of the incident. Any incident investigation ends up as hardware or a human error and closed accordingly. If it is human error I pity the guy and pray that he/she lives for another day on the project/client as in most cases he or she is made the scapegoat and root cause closed with another step on check list for verification.
In the days of digital IT does the concept of setting up NOC makes sense for running operations? For running your Cloud operations who do you need to put in a room for 24X7? and do you get resources that are willing to sit a room in 3 shifts? Will SRE teams solve this problem? what is future of operations?
Site Reliability engineers (SRE) team concept is born at google for running their operations and caught up at other large organizations especially organizations that have matured DevOps practices. So, the face of new Operations teams = SRE teams? Is SRE teams concept silver bullet for solving all operations issues? Will these teams ensure high availability of your apps and environments? what is difference between DevOps and SRE? Too many questions.
DevOps names too many continuous terms: Continuous Integration, Continuous Testing, Continuous Delivery and Continuous deployment and I would say Continuous *.*. And there is Continuous Operations. I put SRE role as part of Continuous Operations. When we go through the rigor of DevOps and do thorough testing from Unit to UAT we are theoretically supposed to pass zero defects into production. That is the ideal case. Most of the production incidents arise out of defects that are passed into production or changes that are implemented in production. With shift left and shift right and zero defects carried to production the pundits of DevOps are preaching we do not need Operations, have you heard: NOOPS? SREs the new roles introduced are the old operations guys with automation skills. These guys now get license to spend time in development, testing phases in the spirit of collaboration and sharing. But, please be aware there is no old habit of tossing your app/service to SRE team after you are done like the old days when there was a big Berlin wall between development and operations. SRE teams are constantly working on operations side to see how to bring automation in IaaS and PaaS layers and supporting stable apps and services.
Here is how a hand off happens between a good development and SRE teams. The development team after their service is put in production is responsible for managing it in production for certain period of time. It depends on how stable their app or service is. It could be for a month or 6 months. SRE teams take over stable apps or services and the monitoring and operations is fully automated for them. It is almost these apps or services are in cruise control mode. In some sense SRE teams are old Tier-3 teams that are called for high level consulting and support and Tier-1 and Tier-2 team tasks are fully automated.
Now what replaces NOC? The new age collaboration tools like slack, teams make these individuals work at any location and not have to be sequestered into one room. we do not need these army of Tier-1 folks pouring over the colored alerts on large monitors instead a good monitoring tools like AppDynamics, Dynatrace or New relic or Splunk or ELK can analyze the data and send the relevant alert to the person on call for the problem with relevant information. It could be as specific as line of code that caused the incident or sql statement that caused the issue. Only that DBA or programmer need to wake up and fix the problem and we do not need the Service Management Managers waking up the whole village and inviting them to a incident bridge and run it to eternity.
Off course as purists of ITSM and ITIL expect we will still by machine or man open a ticket in ServiceNow for any change in production and close upon resolution and break no rules.
we will still build aggregate dashboards for management to pour over for IT metrics since they sign our paychecks and know what we have been doing auto/manual way.
So in the days of DevOps/Cloud/SRE/Digital/AI-ML do we need NOC's?
Senior Manager, Technical Production
8 个月I think you are simplifying this a bit much. How you describe a system in production requires a lot of domain knowledge on the development side of how the cloud systems work. I agree that the SRE helps a lot with the automation of many of the alerts, and by using AWS or GCP native alerts and systems, the need for complex NOC is reduced. But when you have a complex system of multiple applications running across a world system ( Say, a Big Game Development Company like EA ) without a consistent level of software hygiene in development, having a NOC is a godsend for older projects as well as a safety net for new projects. -> The future is a mix of NOC's and SRE best practices.
DevOps Engineer | Immediate joiner | Docker, Kubernetes, Terraform, Jenkins, Git, Ansible, AWS, Azure | Specializing in CI/CD, Cloud Infrastructure, and Automation | Driving Efficiency and Scalability
1 年I'm a NOC engineer and it's true what you had asked in the end of the artical "Do we need NOC?" and my answer is 80% yes and 20% no