登录查看更多内容

Running Operations: NOC Vs SRE way

Venky Chennapragada

Principal DevOps, Cloud, Agile, TOGAF 9 Certified Enterprise Architect, Capgemini

发布日期: 2020年1月16日

I am sure someone who has some exposure to Operations knows what is a NOC, but for others let me expand, it is Network Operations Center. It is old way of putting all operations staff in one room: the server, storage, network, db,apps and I apologize if I left anyone. They could be 10's of them or even 100's put in a room. And there could be NOC'S in different geographies or centers. I have seen these folks organized into tiers of support and even here follow the hierarchy of command! Tier-1 sitting below close to the large monitors displaying those colored alerts: green, yellow and red on any IT landscape elements state: network/storage/server/db/app. They are the first responders and are supposed to open a ticket in ITSM tool say olden days Tivoli or Remedy may be now ServiceNow and attempt to fix them based on their knowledge/knowledge base within some SLAs. Most of these folks are freshers that a IT service provider fills the room in their offshore center. when it is close to the upper end of SLA they toss the ticket to Tier -2 the folks that are sitting a step above them and they are supposed to resolve them in their own SLA timelines and toss it to Tier-3 siting few steps above them. The Service Management Managers sit on top of all and are the most anxious people running a incident bridge and running from step to step to get updates and filling their bosses on status and ETA.

After a incident is resolved there is a sigh of relief but then comes the investigating team that ITSM calls - the problem management. And they are tasked with finding the root cause of the incident. Any incident investigation ends up as hardware or a human error and closed accordingly. If it is human error I pity the guy and pray that he/she lives for another day on the project/client as in most cases he or she is made the scapegoat and root cause closed with another step on check list for verification.

In the days of digital IT does the concept of setting up NOC makes sense for running operations? For running your Cloud operations who do you need to put in a room for 24X7? and do you get resources that are willing to sit a room in 3 shifts? Will SRE teams solve this problem? what is future of operations?

Site Reliability engineers (SRE) team concept is born at google for running their operations and caught up at other large organizations especially organizations that have matured DevOps practices. So, the face of new Operations teams = SRE teams? Is SRE teams concept silver bullet for solving all operations issues? Will these teams ensure high availability of your apps and environments? what is difference between DevOps and SRE? Too many questions.

DevOps names too many continuous terms: Continuous Integration, Continuous Testing, Continuous Delivery and Continuous deployment and I would say Continuous *.*. And there is Continuous Operations. I put SRE role as part of Continuous Operations. When we go through the rigor of DevOps and do thorough testing from Unit to UAT we are theoretically supposed to pass zero defects into production. That is the ideal case. Most of the production incidents arise out of defects that are passed into production or changes that are implemented in production. With shift left and shift right and zero defects carried to production the pundits of DevOps are preaching we do not need Operations, have you heard: NOOPS? SREs the new roles introduced are the old operations guys with automation skills. These guys now get license to spend time in development, testing phases in the spirit of collaboration and sharing. But, please be aware there is no old habit of tossing your app/service to SRE team after you are done like the old days when there was a big Berlin wall between development and operations. SRE teams are constantly working on operations side to see how to bring automation in IaaS and PaaS layers and supporting stable apps and services.

Here is how a hand off happens between a good development and SRE teams. The development team after their service is put in production is responsible for managing it in production for certain period of time. It depends on how stable their app or service is. It could be for a month or 6 months. SRE teams take over stable apps or services and the monitoring and operations is fully automated for them. It is almost these apps or services are in cruise control mode. In some sense SRE teams are old Tier-3 teams that are called for high level consulting and support and Tier-1 and Tier-2 team tasks are fully automated.

Now what replaces NOC? The new age collaboration tools like slack, teams make these individuals work at any location and not have to be sequestered into one room. we do not need these army of Tier-1 folks pouring over the colored alerts on large monitors instead a good monitoring tools like AppDynamics, Dynatrace or New relic or Splunk or ELK can analyze the data and send the relevant alert to the person on call for the problem with relevant information. It could be as specific as line of code that caused the incident or sql statement that caused the issue. Only that DBA or programmer need to wake up and fix the problem and we do not need the Service Management Managers waking up the whole village and inviting them to a incident bridge and run it to eternity.

Off course as purists of ITSM and ITIL expect we will still by machine or man open a ticket in ServiceNow for any change in production and close upon resolution and break no rules.

we will still build aggregate dashboards for management to pour over for IT metrics since they sign our paychecks and know what we have been doing auto/manual way.

So in the days of DevOps/Cloud/SRE/Digital/AI-ML do we need NOC's?

Clive Henrick

Senior Manager, Technical Production

8 个月

I think you are simplifying this a bit much. How you describe a system in production requires a lot of domain knowledge on the development side of how the cloud systems work. I agree that the SRE helps a lot with the automation of many of the alerts, and by using AWS or GCP native alerts and systems, the need for complex NOC is reduced. But when you have a complex system of multiple applications running across a world system ( Say, a Big Game Development Company like EA ) without a consistent level of software hygiene in development, having a NOC is a godsend for older projects as well as a safety net for new projects. -> The future is a mix of NOC's and SRE best practices.

Harmeesh Singh A.

DevOps Engineer | Immediate joiner | Docker, Kubernetes, Terraform, Jenkins, Git, Ansible, AWS, Azure | Specializing in CI/CD, Cloud Infrastructure, and Automation | Driving Efficiency and Scalability

1 年

I'm a NOC engineer and it's true what you had asked in the end of the artical "Do we need NOC?" and my answer is 80% yes and 20% no

查看更多评论

要查看或添加评论，请登录

Venky Chennapragada的更多文章

DevOps Drives Digital Transformation

2021年6月27日

DevOps Drives Digital Transformation

DevOps has evolved over the years to support an organization's Digital vision for their future and aids in many ways to…
Growing Intellectual debt - Machine learning etc..

2020年1月23日

Growing Intellectual debt - Machine learning etc..

What is intellectual debt? Many times we know some systems work, some software works, some machines work etc but we do…

1 条评论
DevOps - Future Opportunities

2020年1月7日

DevOps - Future Opportunities

Happy New year to all. I see we have many self styled DevOps experts in the market people who can speak some blah blah…

3 条评论
DevOps Doctor: Listen to your Client fully!

2018年11月17日

DevOps Doctor: Listen to your Client fully!

When you visit your doctor the first time what does he/she do? They listen to you, 90% of the time and you can spurt…
DevOps Transformation: Will it work for me????

2018年11月11日

DevOps Transformation: Will it work for me????

When we first meet the Leadership, Dev, QA and Ops teams as we begin DevOps assessment and transformation phases in an…

1 条评论
DevOps: Culture Change ....

2018年10月4日

DevOps: Culture Change ....

All those that read and practice DevOps might have heard like a old record DevOps is all about People, process and…
Deal Pursuits - Some Pitfalls

2018年8月17日

Deal Pursuits - Some Pitfalls

If you go back in history say couple of decades ago the management at Fortune 500 companies was predominantly from one…
Personalities and Decision Making

2018年7月31日

Personalities and Decision Making

Have you attended a company meeting with few other technical people ( architects, leads etc..

2 条评论
DevOps: Bad Practices

2018年7月23日

DevOps: Bad Practices

The title may sound little weird, as people think DevOps will solve all their problems, take away their technical debt…
Health Care Kiosks - Dr. Watson - Literally

2018年4月9日

Health Care Kiosks - Dr. Watson - Literally

It takes good couple of months in the bay area to get appointment with a specialist and it takes couple of months to…

See all articles

Running Operations: NOC Vs SRE way

Venky Chennapragada

Principal DevOps, Cloud, Agile, TOGAF 9 Certified Enterprise Architect, Capgemini

Venky Chennapragada的更多文章

社区洞察

其他会员也浏览了

What is IT Operations Management? Boost IT Performance with ServiceNow ITOM Solutions by Mergen

IT Concepts for Recruiters - CH 3

How ITIL Changed IT in Sometimes Painful Ways

Complete Guide: SRE Director

Unlock the Power of ITIL?: An Insightful Report on Its Impact on Enterprises Today

IT Ops

Understanding ServiceNow CMDB: A Game-Changer for IT Operations

ITSM adoption forces a streamlined IT operations culture at Desjardins, paves the way to cloud

Azure Cloud IT Service Management

Venky Chennapragada的更多文章

DevOps Drives Digital Transformation

Growing Intellectual debt - Machine learning etc..

DevOps - Future Opportunities

DevOps Doctor: Listen to your Client fully!

DevOps Transformation: Will it work for me????

DevOps: Culture Change ....

Deal Pursuits - Some Pitfalls

Personalities and Decision Making

DevOps: Bad Practices

Health Care Kiosks - Dr. Watson - Literally

社区洞察

其他会员也浏览了

What is IT Operations Management? Boost IT Performance with ServiceNow ITOM Solutions by Mergen

IT Concepts for Recruiters - CH 3

How ITIL Changed IT in Sometimes Painful Ways

Complete Guide: SRE Director

Unlock the Power of ITIL?: An Insightful Report on Its Impact on Enterprises Today

IT Ops

Understanding ServiceNow CMDB: A Game-Changer for IT Operations

ITSM adoption forces a streamlined IT operations culture at Desjardins, paves the way to cloud

Azure Cloud IT Service Management