About the role
Join the Future of Technology with ZILO™
At ZILO™, we're redefining what’s possible in technology. ZILO™ is the UK-based FinTech specialising in global asset and wealth management software, designed to scale and transform businesses of all types using our own developed AI Technology. Our mission is to digitalise the future of the global asset management industry.
We are a team of experts with decades of combined experience at leading firms globally, who thrive in fast-paced environments and want to shape the future of technology. Every individual plays a key role in driving progress and making a real impact. We continuously strive to innovate and improve.
Why work with us? At ZILO™, you'll be part of a dynamic and inclusive environment where creativity thrives. We offer the opportunity to work on cutting-edge technology, collaborate with talented individuals, and contribute to projects that have a real-world impact. We value continuous learning, personal growth, and providing our team with the resources they need to succeed.
Ready to shape the future? Let’s talk.
Role Details
As Lead Site Reliability Engineer, you'll provide technical and people leadership for the SRE team. You'll be responsible for the team's day-to-day operation, technical direction and engineering delivery, whilst working closely with the Director of Global Cloud Infrastructure to execute the wider platform strategy.
Reporting to the Director of Global Cloud Infrastructure, you'll take ownership of the day-to-day leadership and operation of the Site Reliability Engineering team. Working in close partnership with the Director, you'll translate strategic objectives into operational delivery, ensuring the team consistently delivers a resilient, secure and highly available cloud platform.
You'll work closely with Platform Engineering, Software Engineering, Product, Security and Client Operations teams to improve platform reliability through automation, engineering excellence and operational best practices.
You'll also play a key role in shaping the future of AI-assisted operations at ZILO, identifying opportunities to use AI to improve reliability engineering, incident response, automation and engineering productivity.
This is a hands-on engineering leadership role where approximately 70% of your time will be spent working alongside the team as a Site Reliability Engineer, designing solutions, improving automation and supporting production platforms. The remaining 30% will focus on leading the team, including coaching and mentoring engineers, managing performance, workload planning, holidays, on-call rotas and career development. You'll be expected to lead from the front, setting the technical standard through your own engineering contributions.
This role is not suited to candidates looking to move away from hands-on engineering into full-time management. We're looking for someone who enjoys leading people whilst remaining an active Site Reliability Engineer, spending the majority of their time solving technical challenges alongside the team.
Requirements
Key Responsibilities
Technical Leadership
- Lead the Site Reliability Engineering team by example, remaining hands-on with the technology and setting the standard for engineering excellence.
- Provide technical guidance and mentorship to SREs, fostering a culture of collaboration, continuous learning and operational excellence.
- Work closely with Platform Engineering and Software Engineering teams to improve the reliability, scalability and operability of our services.
- Champion SRE principles and best practices across the engineering organisation.
- Encourage the adoption of AI-assisted engineering practices, enabling the team to deliver more effectively while maintaining high standards of quality, security and reliability
People Leadership
- Lead, coach and develop a high-performing team of Site Reliability Engineers.
- Conduct regular one-to-one meetings, performance reviews and career development discussions.
- Manage workload, priorities and sprint commitments across the team.
- Set clear objectives and support engineers in achieving individual and team goals.
- Manage holidays, leave requests, on-call rotas and resource planning to ensure effective operational coverage.
- Recruit, onboard and mentor new team members as the team grows.
- Foster a collaborative, accountable and high-performing engineering culture.
Production Operations
- Own the reliability, availability and performance of ZILO's production platforms.
- Work alongside Platform Engineering and Software Engineering teams to support, maintain and patch production environments.
- Define, measure and continually improve Service Level Indicators (SLIs), Service Level Objectives (SLOs) and error budgets.
- Drive continuous improvements to system resilience, fault tolerance and operational readiness.
- Lead platform capacity planning and performance optimisation activities.
Incident Management
- Lead the technical response to high-severity production incidents, providing calm and effective technical leadership during major outages.
- Coordinate post-incident reviews and Root Cause Analysis (RCA), ensuring corrective and preventative actions are identified, prioritised and delivered.
- Drive continuous improvement of incident management processes, reducing Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
Automation & Platform Engineering
- Eliminate operational toil through automation and engineering-led improvements.
- Design, develop and maintain tooling that improves engineering productivity and operational efficiency.
- Identify opportunities to simplify operational processes through automation and self-service capabilities.
- Work with Platform Engineering to ensure the platform remains scalable, resilient and capable of supporting future business growth.
- Identify opportunities to leverage AI to improve operational efficiency, automate repetitive tasks and enhance engineering productivity.
- Develop and adopt AI-assisted tooling to support troubleshooting, platform operations and continuous improvement.
- Champion the responsible adoption of AI across the engineering organisation.
Observability
- Continuously improve monitoring, alerting and operational dashboards across the platform.
- Enhance platform observability through effective use of metrics, logs and distributed tracing.
- Design meaningful alerts that reduce noise and alert fatigue whilst improving the speed and accuracy of incident detection.
- Promote a data-driven approach to platform reliability using telemetry and operational insights.
Operational Responsibilities
- Participate as an active member of the SRE team, contributing to the day-to-day support and operation of production platforms.
- Participate in the team's on-call rota, responding to production incidents when required.
- Participate in a rotational UK shift pattern to provide operational coverage and close collaboration with UK engineering teams.
- Support planned maintenance activities, production releases and out-of-hours changes where required.
- Ensure the SRE team provides effective operational coverage across business-critical services.
Continuous Improvement
- Promote a culture of continuous improvement, engineering excellence and operational ownership.
- Collaborate with Development teams to embed reliability into the software development lifecycle.
- Identify opportunities to improve platform stability, deployment processes and operational efficiency through engineering best practices.
Required Skills & Experience
Essential
- Significant experience in Infrastructure, DevOps, Platform Engineering or Site Reliability Engineering, including several years in a technical leadership or team leadership role.
- Proven experience leading an SRE team within a production environment.
- Demonstrable experience balancing line management responsibilities with hands-on technical engineering.
- Experience managing engineers through performance reviews, coaching, mentoring and career development.
- Experience managing operational teams, including workload planning, on-call rotas and resource management.
- Strong AWS experience.
- Production Kubernetes experience (Amazon EKS preferred).
- Experience with Infrastructure as Code (Terraform preferred).
- Experience building and maintaining CI/CD pipelines using GitHub Actions or similar.
- Experience implementing and maintaining observability platforms (Grafana, Prometheus, OpenTelemetry and ClickHouse preferred).
- Experience operating highly available, customer-facing SaaS platforms.
- Strong understanding of networking, DNS, TLS, load balancing and cloud infrastructure.
- Experience leading major production incidents and conducting Root Cause Analysis.
- Experience using modern AI-assisted engineering tools and agents (such as Claude Code, GitHub Copilot, ChatGPT or similar) to improve engineering productivity and software delivery.
- Excellent communication and stakeholder management skills.
- Willingness to participate in an on-call rota and rotational UK shift pattern.
Desirable
- Financial Services or FinTech experience.
- Experience supporting regulated environments.
- Experience with Karpenter.
- Experience managing container security.
- Experience with OpenTelemetry.
- AWS Professional or Specialty certifications.
- Kubernetes certifications (CKA preferred).
- Experience with Chaos Engineering and resilience testing.
- Experience building or working with AI agents, AI-assisted automation or agentic engineering workflows.
Technical Stack
You'll ideally have experience with many of the following technologies:
- AWS
- Kubernetes (Amazon EKS)
- Terraform
- GitHub
- GitHub Actions
- Grafana
- Prometheus
- OpenTelemetry
- ClickHouse
- Linux
- Docker
- Python, Go or Bash
- Networking & DNS
- AI-assisted engineering tools (Claude Code, GitHub Copilot, ChatGPT or equivalent)
What Success Looks Like
Within your first 12 months you will:
- Successfully establish yourself as the technical and people leader for the Bangkok SRE team.
- Improve platform availability, resilience and reliability through engineering excellence and Chaos Engineering practices.
- Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
- Improve observability across all production services through meaningful telemetry and actionable alerting.
- Reduce operational toil through automation and AI-assisted engineering.
- Introduce AI-assisted engineering practices that improve operational efficiency and increase team productivity.
- Develop and mentor a high-performing SRE team with clear objectives and strong engineering practices.
- Build strong working relationships with Platform Engineering, Software Engineering and UK-based teams.
- Help foster a culture of ownership, operational excellence and continuous improvement across engineering.
Leadership Competencies
- Hands-on technical leadership
- Systems thinking
- Calm under pressure
- Data-driven decision making
- Coaching and mentoring
- Strong stakeholder management
- Excellent communication skills
- Pragmatic problem solving
- Continuous improvement mindset
- Collaborative approach across engineering, operations and product teams
Candidate Profile
The ideal candidate is a proven engineering leader who has successfully led an SRE team before. You're passionate about building highly reliable cloud platforms, enjoy solving complex distributed systems problems and believe the best leaders remain close to the technology.
You thrive in production environments, are comfortable leading major incidents, and enjoy mentoring engineers through example rather than hierarchy. You understand that reliability is built through automation, observability and engineering excellence—not manual processes—and you're motivated by helping both people and platforms perform at their best.
At ZILO, we believe AI is transforming software engineering. We're building an AI-native engineering organisation where engineers use AI as a force multiplier—not a replacement—to automate repetitive work, accelerate problem solving and focus on delivering greater value. We're looking for someone who shares that mindset and is excited to help shape the future of Site Reliability Engineering through the practical application of AI.
Benefits
- 23 Annual days holiday (Start and Fix at 23 days)
- 15 Public Holidays
- Provident Fund
- Health insurance (including immediate family)