Description
In this course, you will :
- Learn about site reliability engineering's roles and responsibilities, as well as how they differ from those of other teams.
- Learn how the role helps an enterprise improve, discuss the associated costs, learn about the different types of team members, and about the tools that a team may use.
- Monitoring, high availability (HA) and disaster recovery (DR), infrastructure as code, and database recovery and availability are all covered.
- Learn the fundamentals of SLOs and SLIs, as well as how to convert them into queries and, finally, graphs.
- Learn how to create and deploy highly available databases to AWS as well.
- Learn how to deploy microservices or cloud architecture that is resilient enough to withstand failures while also being predictable enough to resolve issues through automation without the need for human intervention.
- Understand the fundamentals of self-healing system design, deployment strategies, implementation steps, and use cases.
- Learn cloud automation to improve system resiliency.
- Work through the incident management process and learn how to have effective on-calls to learn how to develop processes and frameworks that drive workplaces toward putting reliability first.
- Learn how to conduct reliability reviews on different phases of your system, how to effectively manage system capacity, and how to reduce toil.
Syllabus :
- Foundations of Observability
- Planning for High Availability and Incident Response
- Self-Healing Architecture
- Establishing a Culture of Reliability