As a Senior Site Reliability Engineer you will drive adoption of SRE best practice across our cloud estate. Utilising both your soft skills and technical experience, you will work with teams to ensure our standards and governance is met. By onboarding our services into the cloud, through a dedicated assessment stage gate process. So that in turn our citizen facing applications satisfy all the required operational and security needs for running in production.
You will execute deployments using runbooks, investigate production incidents and provide dedicated support to teams to determine the root cause. You will provide an on-call service to help restore services, through dedicated run books or technical experience. You will help to reduce toil and increase automation; by developing reliability to ensure we have a reduction of the time to live, and cost spend on repetitive tasks. You will provide guidance and influence best practice to our development teams.
Terraform, Ansible, Python, Bash, Gitlab CI/CD, AWS or Azure managed services, Monitoring services (Cloudwatch/Prometheus/Azure Monitor), Containers
Design and develop the techniques for improving application reliability, run books, knowledge transfer to the UXCC, and ongoing SRE strategy within your Functional and Professional Communities
Act as the focal point for the investigation and resolution of major or complex incidents for the service, ensuring people with the right skills and expertise are proactively available to respond effectively
Assess the impact of change requests in consultation with stakeholders, providing technical expertise and authorising the implementation of subsequent changes
Be on-call for applications that require out-of-hours SRE coverage
Undertake comprehensive analysis of performance trends to identify root cause analysis, progressing opportunities to improve reliability, security, capability of infrastructure, application and site services
Actively engage with senior stakeholders and provide clear communication of incident resolution and service improvements.
Assure critical changes to the applications and supporting infrastructure
Develop and maintain relevant knowledge such that it can be easily annotated, updated, referenced, and consumed
Conduct code assessments, with a view to correcting errors and providing recommendations for reliability improvements
Manage the team backlog for the applications for which you are accountable
Coach and mentor application development and operations engineers in the practice and techniques of SRE
Conduct reflectives for all high priority and major incidents ensuring they are done quickly and published
Routinely seek views and capture ideas from stakeholders and team members for improvements and encourage collaboration and innovation
Interdepartmental discussions and meetings with a wide variety of external bodies and organisations on a local, regional, national or international basis, leading community discussions about SRE best practice within Engineering.
Desirable – Kubernetes, System administration (RHEL, Windows Server, etc), Network configuration (DNS, routing, load balancing, etc)