Back to Job Search

Lead Site Reliability Engineer

Job Description

As a Senior Site Reliability Engineer you will drive adoption of SRE best practice across our cloud estate. Utilising both your soft skills and technical experience, you will work with teams to ensure our standards and governance is met. By onboarding our services into the cloud, through a dedicated assessment stage gate process. So that in turn our citizen facing applications satisfy all the required operational and security needs for running in production. 

  • Responsible for contributing authoritative advice and guidance to others in the organisation and externally.

  • Design and develop the techniques for improving application reliability, run books, knowledge transfer to the UXCC, and ongoing SRE strategy within your Functional and Professional Communities

  • Act as the focal point for the investigation and resolution of major or complex incidents for the service, ensuring people with the right skills and expertise are proactively available to respond effectively

  • Assess the impact of change requests in consultation with stakeholders, providing technical expertise and authorising the implementation of subsequent changes

  • Undertake comprehensive analysis of performance trends to identify root cause analysis, progressing opportunities to improve reliability, security, capability of infrastructure, application and site services

  • Actively engage with senior stakeholders and provide clear communication of incident resolution and service improvements.

  • Assure critical changes to the applications and supporting infrastructure

  • Conduct code assessments, with a view to correcting errors and providing recommendations for reliability improvements

  • Interdepartmental discussions and meetings with a wide variety of external bodies and organisations on a local, regional, national or international basis, leading community discussions about SRE best practice within Engineering.


Your skills and experience:

  • Essential - Terraform, Ansible, Python, Bash, Gitlab CI/CD, AWS or Azure managed services, Monitoring services (Cloudwatch/Prometheus/Azure Monitor), Containers

  • Desirable – Kubernetes, System administration (RHEL, Windows Server, etc), Network configuration (DNS, routing, load balancing, etc)

  • Understanding of security engineering and security best practice

  • Ability to architect and administer scalable, cloud-native and on premise applications

  • Strong time management, and change management skills.

  • Strong communications skills across multiple stakeholder types

  • Strong skills in setting, communicating, implementing, and achieving business objectives and goals through direct management

  • Skilled knowledge and ability in modifying and maintaining systems and code developed by other engineers

  • The ability to lead engineers in a complex, multi-disciplinary environment, delivering products within specific timescales