At Shakudo, we're building the world's first operating system for data and AI. We use the term "operating system" in the truest sense: just like iOS, Windows, or Linux, Shakudo's end-to-end OS provides ever-evolving, fully automated, best-in-class open-source components tailored to each business's unique needs.
We are seeking an Infrastructure Engineer to join our Business Automation team to own and operate the internal systems, infrastructure, and AI Gateway product that power Shakudo at scale. This is a hands-on role for someone who thrives on keeping production systems reliable, secure, and fast. You will be responsible for everything from physical servers and DGX machines to CI/CD pipelines and customer-facing AI Gateway infrastructure. You will also contribute directly to product hardening, security, and DevOps practices across the platform.
At Shakudo, our culture is proactive, collaborative, and supportive — we succeed together by building strong partnerships and solving complex challenges. We expect high ownership: you will be hands-on, driving outcomes directly rather than delegating or waiting for direction. Individual contribution matters here — your work will have a visible, measurable impact on the company's operations and product.
Key Responsibilities
Maintain and operate internal services for the rest of the Shakudo employees, including proprietary applications for sales and ETL pipelines
Maintain and operate DGX machines that host LLMs for the team's use
Maintain and operate Shakudo's product for Shakudo's internal use, and contribute to product hardening, security, and DevOps practices
Maintain and operate physical servers for Kubernetes clusters and ensure uptime
Create CI/CD pipelines for internal deployments
Maintain and operate the AI Gateway product for customers, ensure uptime, and contribute to product roadmap
Qualifications
8+ years of experience across software, data, platform, or AI engineering roles
5+ years of strong experience with Kubernetes cluster operation and DevOps, and bare-metal server operations
Experience operating production infrastructure at scale, including physical servers, GPU clusters, and CI/CD systems
Strong background in security hardening, observability, and reliability engineering
Proficiency in Rust is preferred
Experience with AI/ML infrastructure, including LLM hosting and inference serving is preferred
Why Shakudo Stands Out
Work with cutting-edge technologies in machine learning and high-performance computing. Contribute to a platform that transforms how organizations leverage data and AI. Join a dynamic team that values innovation, efficiency, and diversity.
Shakudo offers a high-impact package: competitive salary, meaningful equity so you share in the upside of transformational technology, and comprehensive health benefits that have you fully covered. We provide a flexible vacation policy—because building transformational technology requires supporting the people who build it. More importantly, you'll work on technology that matters.
This role is based onsite in Toronto to support the high security requirements of our clients and enable effective collaboration. We have a welcoming office environment with a very focused and passionate team, doing meaningful, impactful work together.
Shakudo is an equal opportunity employer and encourages candidates of all backgrounds to apply. We foster diversity and inclusivity and welcome applications from a broad range of backgrounds and experiences.