登录查看更多内容

#1 What's Site Reliability Engineering [SRE] | Roles & Responsibilities | Technologies involved

Tharun Moorthy

Site Reliability Engineer at PhonePe | Blog writer

发布日期: 2022年1月8日

+ 关注

Site reliability engineering

Site Reliability Engineering, also popularly referred to as the SRE, is a role in Computer Science Engineering where the main purpose is to provision, maintain, monitor, and manage the infrastructure in order to provide maximum application uptime and reliability. SRE is an emerging role, but the tasks that the SRE does were always there ever since the first application that was developed. The scope of the software developers ends where they write code to develop the application and right from setting up the infrastructure, the various services that run on them, the network connectivity that is required, providing a platform for the application to run and making sure every part of the application is up and running reliably 24x7 is the duty of an SRE. In fact, we can consider Site Reliability Engineers are the strong bridge between the users and a reliable application.

Now, in order to explain the different responsibilities of an SRE, I have divided it into 4 different categories. I have always seen SRE this way, and definitely not as some ad-hoc process. The four categories in which I would classify the tasks of a Site Reliability Engineer are:

1. Creation

2. Monitoring

3. Management

4. Destruction

Let's dive deep into each one of them.

Creation

1. Provision virtual machines / PXE Baremetals

SREs are responsible for provisioning the virtual machines with the requested resources in terms of CPU, memory, disks, network configurations, and operating system. In case a bare metal needs to be set up, it is also performed with the provided configurations. The SREs use Linux commands, automation scripts to provision the server as quickly as possible. They are also responsible to be rack aware during provisioning. Example operating systems involve Linux Ubuntu, CentOS, Windows.

2. Setup services

Once the machines are provisioned, the SRE also takes care of setting up the services on the machines. These services can be networking services, proxy or load balancing services, container or orchestration services, message queues, databases, caching systems, big data services, or more, along with the disk setup. In this way, the SRE are exposed to a variety of technology and play an important role in the components involved in an application. Example technologies involve NGINX, Apache, RabbitMQ, Kafka, Hadoop, Traefik, MySQL, PostgreSQL, Aerospike, MongoDB, Redis, MinIO, Kubernetes, Apache Mesos, Marathon, MariaDB, Galera.

3. Optimize the infrastructure

Since there are several components and services that are being used in the infrastructure, there is a scope for improvements in terms of performance, efficiency, and security. The SRE optimizes the components by keeping them up to date, choosing the right service for the right job, patching the servers.

领英推荐

Kubernetes’ Management Revolution: From Infrastructure…

KWAN 1 个月前

A cheat sheet for Ansible, which covers common…

UlugBeck Nurmatov 7 个月前

Exploring the Evolution of Observability: From 1.0 to…

Marcel Koert 5 个月前

4. Write monitoring scripts

When the SRE are involved in maintaining an infrastructure of any size, they never underestimate any component of the infrastructure and write a monitoring script to monitor the components and metrics of each and every one of them. This provides the ability to get real-time alerts on any of the components malfunctioning and also a better view of the infrastructure. The SRE uses programming languages like Bash, Python, Golang, Perl, and tools like daemon processes, Riemann, InfluxDB, OpenTSDB, Kafka, Grafana, Prometheus, and APIs to monitor the infrastructure.

5. Write automation scripts

If there are more than 10 steps to be performed and chances are that the task has to be performed more than once, the SRE never hesitate to automate the task. This saves time and also prevents human error. The SRE uses programming languages like Bash, Python, Golang, Perl, Ansible to automate the tasks.

6. Manage users on the machines

One of the main security precaution that the SRE take is to restrict user access to the components in the infrastructure. They use various technologies like VPN ( Virtual Private Network ), firewall, configuration files, user management on machines, LDAP, sudoer configuration, PAM, OTP, two-factor authentications, SSH keys, and more to avoid unauthorized access to any component of the infrastructure.

These are the creative aspects of a Site Reliability Engineer. In the next article, we will read about the Monitor aspect of a Site Reliability Engineer.

Complete Video:

Watch the video above or listen to the full podcast exclusively below

Podcast:

You can find more articles here: https://www.tharunshiv.com

Thank you

Check out my YouTube Channel here: Developer Tharun

Thank you for reading!

Carlos Hernández Conti

6 个月

SRE, should use the same principles of machine reliability, such RBD's, redundancy, predict H&S failures, etc.

Rohit Jain

Engineer @PhonePe

3 年

Really Awesome Video ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Tharun Moorthy的更多文章

8 ways to become a BETTER SRE ( Right now! ) | 8 non-technical characteristics to have

2022年1月14日

8 ways to become a BETTER SRE ( Right now! ) | 8 non-technical characteristics to have

Site Reliability Engineering, also popularly referred to as the SRE, is a role in Computer Science Engineering where…
Hashicorp Vault | Dev and Prod server setup | Unseal | Policies | TLS setup

2022年1月2日

Hashicorp Vault | Dev and Prod server setup | Unseal | Policies | TLS setup

Link to the article on Why we need Vault and what problem it solves Hashicorp Vault Hashicorp Vault is an opensource…
Hashicorp Vault | What & Why? | All you need to know about Vault | Secrets management for roadrunners

2022年1月2日

Hashicorp Vault | What & Why? | All you need to know about Vault | Secrets management for roadrunners

Hashicorp Vault Hashicorp Vault is an opensource software from Hashicorp. Vault is used to manage secrets.

1 条评论
Encryption & decryption for r0@drunner$

2021年10月24日

Encryption & decryption for r0@drunner$

The modern internet today is programmed in such a way that, the traffic is encrypted. Now, what does this encryption…
Microservices Architecture for Roadrunners | 6 key?benefits

2021年10月24日

Microservices Architecture for Roadrunners | 6 key?benefits

Freaky Analogy Imagine a feature in your application that records the number of times a user has visited a page…
Be safe with the SSH protocol version you use — for Roadrunners

2021年10月24日

Be safe with the SSH protocol version you use — for Roadrunners

SSH ( Secure Shell ) is a program that is used to create a secure connection between two network-enabled hosts. With…
Server virtualization for roadrunners | Hypervisor

2021年10月17日

Server virtualization for roadrunners | Hypervisor

?? Listen to the episode…
x86 vs x64 systems for roadrunners

2021年10月17日

x86 vs x64 systems for roadrunners

x86 Systems x86 is a generic name that is used to refer to all Intel processors that were released after the original…
Quantum Supremacy for Kids

2019年10月30日

Quantum Supremacy for Kids

What is Quantum Supremacy? Did Google just achieve Quantum Supremacy? What can Quantum Computing achieve? Why does it…

See all articles

#1 What's Site Reliability Engineering [SRE] | Roles & Responsibilities | Technologies involved

Tharun Moorthy

Site Reliability Engineer at PhonePe | Blog writer

Site reliability engineering

Creation

1. Provision virtual machines / PXE Baremetals

2. Setup services

3. Optimize the infrastructure

领英推荐

4. Write monitoring scripts

5. Write automation scripts

6. Manage users on the machines

Complete Video:

Podcast:

Tharun Moorthy的更多文章

社区洞察

其他会员也浏览了

How I caused my first Production Incident

Tips and Best practices for Load Testing Using Kubernetes.

Docker Best Practices Optimizing Containerized Environments

Day 27 : Infrastructure as Code (IaC) #90DaysofDevOps

Senior SRE (Site Reliability Engineer)

Revolutionizing Industries: Solving Challenges with Ansible

What is Infrastructure as Code? The Basics Explained

What Platform Engineers Need

Enabling Infrastructure as Code (IaC) and CI/CD: Key Benefits for Customers

CloudCast: Insights on Incident Reports, Monitoring, and Terraform Innovations

Site reliability engineering

Creation

1. Provision virtual machines / PXE Baremetals

2. Setup services

3. Optimize the infrastructure

领英推荐

4. Write monitoring scripts

5. Write automation scripts

6. Manage users on the machines

Complete Video:

Podcast:

Tharun Moorthy的更多文章

8 ways to become a BETTER SRE ( Right now! ) | 8 non-technical characteristics to have

Hashicorp Vault | Dev and Prod server setup | Unseal | Policies | TLS setup

Hashicorp Vault | What & Why? | All you need to know about Vault | Secrets management for roadrunners

Encryption & decryption for r0@drunner$

Microservices Architecture for Roadrunners | 6 key?benefits

Be safe with the SSH protocol version you use — for Roadrunners

Server virtualization for roadrunners | Hypervisor

x86 vs x64 systems for roadrunners

Quantum Supremacy for Kids

社区洞察

其他会员也浏览了

How I caused my first Production Incident

Tips and Best practices for Load Testing Using Kubernetes.

Docker Best Practices Optimizing Containerized Environments

Day 27 : Infrastructure as Code (IaC) #90DaysofDevOps

Senior SRE (Site Reliability Engineer)

Revolutionizing Industries: Solving Challenges with Ansible

What is Infrastructure as Code? The Basics Explained

What Platform Engineers Need

Enabling Infrastructure as Code (IaC) and CI/CD: Key Benefits for Customers

CloudCast: Insights on Incident Reports, Monitoring, and Terraform Innovations