About the Role
We’re looking for a Senior SRE / DevOps Engineer to join our Platform Tribe - a lean & senior team where ownership is high and expectations are even higher. This is a deeply hands-on role at the core of a high-traffic system, where you’ll be directly responsible for maintaining reliability, performance, and stability in a fast-paced environment.
You’ll be working on real-time production challenges, handling incidents, managing alerts, and being part of a critical on-call rotation. This role requires resilience, strong decision-making under pressure, and a proactive mindset to continuously improve systems operating at scale.
If you thrive in high-load environments, enjoy solving complex production issues, and want to have a direct impact on systems used by millions - this is the place for you.
Key Responsibilities
- Own system reliability by actively monitoring platform health, managing alerts, and responding to incidents in real time
- Participate in 24/7 on-call rotations, taking full ownership of production stability in a high-traffic (5–7k RPS) environment
- Investigate incidents, perform root cause analysis, and implement long-term fixes to prevent recurrence
- Build and continuously improve monitoring, alerting, and observability across the Kubernetes (EKS) ecosystem
- Deploy, manage, and optimise infrastructure using Terraform, Helm, and GitOps tools (Flux/ArgoCD)
- Drive automation and proactively improve system resilience, reducing manual intervention and recurring issues
- Maintain and evolve CI/CD pipelines and infrastructure-as-code practices
- Collaborate closely with engineering teams to support deployments and minimise user impact in a live environment
- Introduce and integrate new tools and technologies to enhance scalability, reliability, and performance
- Handle environment-specific requests and ensure smooth day-to-day platform operations under constant load
Requirements
- Strong hands-on experience with Kubernetes (deployment, scaling, troubleshooting) in high-load environments
- Experience with GitOps tools such as FluxCD or ArgoCD
- Proven experience in incident response, root cause analysis, and postmortems in production systems
- Solid experience with AWS, Terraform, Docker, and CI/CD pipelines
- Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, and logging stacks like ELK or CloudWatch
- Strong understanding of networking concepts and protocols
- Proficiency in at least one scripting language (e.g. Python, Go, Node.js)
- Experience working with version control systems (Git)
- Familiarity with incident management tools like PagerDuty, Opsgenie, or similar
- Ability to operate effectively in a fast-paced, high-pressure environment with strong ownership and accountability
- Proactive, resilient mindset with a focus on continuous improvement and system stability
What We Offer
- Competitive Salary
- Quarterly Bonuses
- Unlimited Paid Time Off
- Unlimited Paid Sick Leave
- Remote & Flexible Working
- Private Medical Insurance
- Financial Support for Life Events
- Professional Development Budget
- International Exposure
- Regular Company Events
*Benefits may vary depending on location and contractual agreement
Recruitment Process
1. HR Interview (30-45 min)
2. Technical interview (90 min)