Join Our Team
Oowlish, one of Latin America's rapidly expanding software development companies, is seeking experienced technology professionals to enhance our diverse and vibrant team.
As a valued member of Oowlish, you will collaborate with premier clients from the United States and Europe, contributing to pioneering digital solutions. Our commitment to creating a nurturing work environment is recognized by our certification as a Great Place to Work, where you will have opportunities for professional development, growth, and a chance to make a significant international impact.
We offer the convenience of remote work, allowing you to craft a work-life balance that suits your personal and professional needs. We're looking for candidates who are passionate about technology, proficient in English, and excited to engage in remote collaboration for a worldwide presence.
About the Role:
We are looking for an experienced Senior Site Reliability Engineer (SRE) to own the reliability, availability, and operational excellence of business-critical production systems.
This is a dedicated Site Reliability Engineering role—not a general DevOps or Infrastructure position. You will define how reliability is measured, lead incident response during production outages, drive observability strategy, and continuously improve operational practices across high-availability environments.
The ideal candidate has hands-on experience managing SLOs, leading major incidents, improving on-call operations, and building a strong reliability culture through automation, observability, and continuous improvement.
Responsibilities:
Define, implement, and continuously improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
Develop and maintain observability strategies, including monitoring, logging, tracing, and alerting.
Own observability configuration, instrumentation, and alert optimization.
Lead Incident Command during production incidents and coordinate cross-functional response efforts.
Drive blameless postmortems and ensure corrective actions are completed.
Own and continuously improve the on-call program, including rotations, escalation policies, runbooks, and alert tuning.
Establish production readiness standards for new services.
Partner with engineering teams on capacity planning, scalability, and disaster recovery initiatives.
Automate operational processes and reliability improvements using software engineering best practices.
Continuously improve system reliability, availability, and operational efficiency.
Requirements:
5+ years of experience in Site Reliability Engineering, Production Engineering, Reliability Engineering, or similar roles.
Proven experience operating production systems in high-availability environments.
Hands-on experience defining and managing SLOs, SLIs, and Error Budgets.
Experience leading production incident response and Incident Command.
Strong observability and monitoring experience.
Strong software engineering skills using Python, Go, or TypeScript.
Experience working with cloud platforms.
Strong written and verbal English communication skills.
Must have:
Proven Site Reliability Engineering experience.
Experience defining and managing:
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Error Budgets
- Experience leading Incident Command during major production incidents.
- Experience conducting blameless postmortems and driving follow-up actions.
- Experience designing, maintaining, and improving on-call programs.
- Experience developing runbooks and escalation policies.
- Strong observability experience, including:
- Monitoring
- Logging
- Alerting
- Distributed Tracing
- Experience tuning alerts to reduce operational noise.
- Strong automation skills using Python, Go, or TypeScript.
- Experience supporting mission-critical production systems.
- Experience working in high-availability production environments.
Nice to have:
Experience with Datadog.
Experience with AWS.
Experience with Heroku.
Experience working in regulated industries (Healthcare, HIPAA, Financial Services, etc.).
Experience establishing or maturing an SRE practice.
Capacity planning experience.
Disaster recovery planning and execution.
Experience with Kubernetes.
Experience with PostgreSQL or SQL Server.
Experience supporting modern TypeScript-based applications.
Benefits & Perks:
Home office;
Competitive compensation based on experience;
Career plans to allow for extensive growth in the company;
International Projects;
Oowlish English Program (Technical and Conversational);
Oowlish Fitness with Total Pass;
Games and Competitions;
You can also apply here:
Website: https://www.oowlish.com/work-with-us/
LinkedIn: https://www.linkedin.com/company/oowlish/jobs/
Instagram: https://www.instagram.com/oowlishtechnology/