This is an exciting opportunity for an experienced Site Reliability Engineer (SRE) to join a leading and well established eCommerce company within the food industry. We are looking for an experienced and passionate Senior Site Reliability engineer to help us move our Platform to the next level. You would be joining the SRE team which is currently being formed. You will be able to have a massive impact on the shape of the team and how it delivers to Tech.
The SRE team will achieve this by providing Platform tooling such as monitoring, alerting, and testing capabilities. As well as improving operation knowledge and enabling operational best practices across the development teams. The team will own processes around incident management and encourage a learning culture. Tech has a DevOps mentality. Our development squads are responsible for the complete lifecycle of the software they build, including operational responsibilities in production.
- A depth of knowledge of Site Reliability Engineering. You are experienced in enabling platform stability through tooling, operational processes and sharing operational knowledge.
- Ability to influence and introduce change into development teams. This might be introducing new tooling, new processes, or ways of working to improve operational stability.
- Excellent communication and presentation skills, whether to the engineering team, to business stakeholders or to our leadership team.
- You are curious and are always looking to learn. You encourage a culture of learning, emphasising the importance of breadth as well as depth of knowledge.
- You deliver rapidly in small batches, reducing risk and creating a fast feedback loop. You have a continuous improvement mindset, constantly seeking to reduce waste and avoid re-work.
- Experience in working in an SRE team. Proven track record of working with multiple development teams to improve operational quality of software they build.
- Operational experience of building, running, and monitoring microservices running on a Cloud Platform.
- Experience of using and implementing application monitoring tools (Pagerduty, ELK, NewRelic).
- Experience of owning processes to improve operational best practices (Production incident process).
- Experience in working in a lean and agile environment.
- Proven knowledge and experience of using AWS in a Production.
- Experience in development language (Python).
Skills: DevOps, Site Reliability Engineer, AWS, Azure, CI/CD, Cloud, Jenkins, Docker, Kubernetes (K8S), Deployment, Splunk, Ansible, Python, Ruby, Shell.