All active MLOps roles based in Singapore.
Pick a job to read the details
Tap any role on the left — its description and apply link will open here.
About Nebius:
Nebius is leading a new era in cloud infrastructure for the global AI economy. We are building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment, without the cost and complexity of building large in-house AI/ML infrastructure.
Built by engineers, for engineers. From large-scale GPU orchestration to inference optimization, we own the hard problems across compute, storage, networking and applied AI.
Listed on Nasdaq (NBIS) and headquartered in Amsterdam, we have a global footprint with R&D hubs across Europe, the UK, North America and Israel. Our team of 1,500+ includes hundreds of engineers with deep expertise across hardware, software and AI R&D.
We seek an experienced Senior ML Solutions Architect to support customers leveraging Nebius Token Factory's serverless inference platform for open-source LLMs across multiple modalities. In this role, you will be collaborating with clients to design and implement customized LLM-based solution and architect scalable AI applications using our served models, and working together with our backend team to improve our platform to match the clients' needs.
You’re welcome to work remotely from Singapore.
Your responsibilities will include:
We expect you to have:
It would be an added bonus if you have:
Preferred technical stack:
Benefits & Perks:
What's it like to work at Nebius:
Fast moving - Bold thinking - Constant growth - Meaningful impact - Trust and real ownership - Opportunity to shape the future of AI
Equal Opportunity Statement:
Nebius is an equal opportunity employer. We are committed to fostering an inclusive and diverse workplace and to providing equal employment opportunities in all aspects of employment. We do not discriminate on the basis of race, color, religion, sex (including pregnancy), national origin, ancestry, age, disability, genetic information, marital status, veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by applicable law.
Applicants must be authorized to work in the country in which they apply and will be required to provide proof of employment eligibility as a condition of hire.
If you need accommodations during the application process, please let us know.
Ready to apply?
Apply to Nebius
HPC & Cloud Infrastructure Engineer
Important Information
Location: Singapore
12 months contract
Job Summary
We’re hiring an HPC & Cloud Infrastructure Engineer to design, deploy, and optimize high-performance computing environments across on-prem and cloud. You’ll manage HPC clusters, interconnects, job schedulers, and enable AI/ML workloads at scale while driving automation and cost efficiency
Job Description
Architect, deploy, and manage HPC clusters with job schedulers, parallel file systems, and cluster management tools
Design, configure, and troubleshoot Infiniband high-throughput, low-latency interconnects for HPC/distributed workloads
Own PBS Professional scheduling: deployment, queue optimization, custom job submission scripts, workload management
Administer RHEL-based systems: performance tuning, package management, security hardening, patching via Red Hat Satellite and Ansible
Build and maintain cloud HPC environments on AWS, Azure, and GCP – provisioning, hybrid setups, migrations, and cost optimization
Implement Infrastructure as Code using Terraform/Ansible and integrate with CI/CD pipelines for reproducible infrastructure
Enable GPU & AI/ML workloads: containers, TensorFlow, PyTorch, scikit-learn, Keras, MXNet; support MLOps pipelines for training and deployment
Optimize parallel applications using MPI and OpenMP; debug and scale distributed/shared memory workloads
Drive monitoring, logging, and alerting for cluster health, job efficiency, and resource utilization
Required Skills and Experience
High-Performance Computing
Hands on experience in managing HPC clusters with job scheduler, cluster management parallel programming libraries, and parallel filesystems.
Knowledge of resource scheduling and job optimization for efficient workload management
Infiniband (Networking)
Hands-on experience with high-throughput, low-latency interconnect technologies like Infiniband.
Ability to design, configure, and troubleshoot interconnects in HPC or distributed environments.
Operating Systems and Environments
Administration and configuration of RHEL-based systems.
Performance tuning, package management, and security hardening.
Knowledge of Red Hat Satellite and Ansible for automation.
Job Scheduling with PBS Professional
Experience in deploying and managing PBS Professional for scheduling and workload management in HPC environments.
Customizing job submission scripts and optimizing job queues.
Parallel Programming Libraries
MPI (Message Passing Interface) and OpenMP (Open Multi-Processing):
Proficiency in writing, debugging, and optimizing parallelized code.
Experience with scaling applications across HPC systems.
Understanding of distributed memory (MPI) and shared memory (OpenMP)
paradigms.
Cloud Platforms
AWS, Azure, Google Cloud:
Expertise in provisioning, configuring, and managing services on all three platforms.
Cross-platform migration and hybrid cloud solutions knowledge.
Proficiency in managing high-performance computing (HPC) clusters on the cloud.
Deep understanding of cost optimization, security, and cloud native development tools (e.g., Kubernetes, Terraform).
Infrastructure as Code (IaC)
Ability to design, deploy, and maintain infrastructure using automation and configuration management tools.
CI/CD pipeline integration for IaC workflows.
GPU & AI Libraries and Tools
Hands-on experience with container technologies.
Hands-on experience with TensorFlow, PyTorch, scikit-learn, Keras, or MXNet.
Familiarity with AI/ML pipelines, model training, and optimization.
Knowledge of MLOps tools for deploying and monitoring models
About Encora
Encora is a global company that offers Software and Digital Engineering solutions. Our practices include Cloud Services, Product Engineering & Application Modernization, Data & Analytics, Digital Experience & Design Services, DevSecOps, Cybersecurity, Quality Engineering, AI & LLM Engineering, among others.
At Encora, we hire professionals based solely on their skills and do not discriminate based on age, disability, religion, gender, sexual orientation, socioeconomic status, or nationality.
Ready to apply?
Apply to EncoraWorkato delivers enterprise infrastructure for the agentic era, redefining iPaaS and helping enterprises unify data, applications, processes, and AI into a single, governed platform. A leader in Enterprise MCP and trusted by 50% of the Fortune 500, Workato’s cloud-native architecture connects every application, data source, and process to power real-time orchestration at scale. With enterprise-grade security and continuous innovation at its core, Workato provides the trusted foundation for organizations to automate with confidence and operationalize AI across the business. To learn more, visit www.workato.com
Ultimately, Workato believes in fostering a flexible, trust-oriented culture that empowers everyone to take full ownership of their roles. We are driven by innovation and looking for team players who want to actively build our company.
But, we also believe in balancing productivity with self-care. That’s why we offer all of our employees a vibrant and dynamic work environment along with a multitude of benefits they can enjoy inside and outside of their work lives.
If this sounds right up your alley, please submit an application. We look forward to getting to know you!
Also, feel free to check out why:
Business Insider named us an “enterprise startup to bet your career on”
Forbes’ Cloud 100 recognized us as one of the top 100 private cloud companies in the world
Deloitte Tech Fast 500 ranked us as the 17th fastest growing tech company in the Bay Area, and 96th in North America
Quartz ranked us the #1 best company for remote workers
We are looking for an exceptional AI Researcher to join our growing AI team. In this role, you will design, build, deploy, and improve ML/LLM-powered services and features that power intelligent automation and AI-driven product experiences across the Workato platform. You will work closely with our Engineering, Product, and Design teams to define and track product metrics and evaluation strategies, design customer-facing experiments and dive deep to provide actionable insights.This role is ideal for someone who combines strong ML/LLM intuition, software engineering skills and a practical mindset for shipping reliable, scalable AI systems.
Build and improve AI services using LLMs and custom machine learning models for production use cases.
Design, develop, and operate ML/LLM systems end-to-end, from prototyping to deployment and monitoring.
Write high-quality Python code that is testable, maintainable, and efficient.
Improve validation, observability, and performance monitoring for ML services (quality, latency, reliability, cost).
Partner cross-functionally with product managers, platform engineers, and other stakeholders to ship AI-powered product capabilities.
Evaluate and improve existing implementations by identifying bottlenecks, bugs, and opportunities for optimization.
Design controlled experiments to test the features for our AI-based products and perform deep analysis from the results to find actionable insights
Contribute to technical design and code reviews, helping raise engineering quality across the team.
Experiment and iterate on model behavior, prompting, retrieval, tool use, or orchestration strategies to improve user outcomes.
Experience with tool-use agents or workflow-aware AI systems.
Experience building AI products in enterprise SaaS environments.
Experience with A/B testing and statistical significance techniques.
Experience with LLMOps/MLOps tooling and practices (monitoring, evaluation pipelines, model rollout, CI/CD).
Experience working with modern data warehouses such as Amazon Redshift Snowflake.
Job Req ID: 2724
Ready to apply?
Apply to Workato
Share this job
Mission Summary:
We are seeking an experienced and visionary Principle Level Tech Lead Manager to build and lead our new Machine Learning (ML) Acceleration team. This pivotal role will drive the strategy, development, and execution of initiatives aimed at significantly accelerating ML model training. The ultimate goal is to drastically reduce the development cycle for new ML models and enable rapid hot-patching for issues within our deployed autonomous vehicle services.
You will be a hands-on leader, blending deep technical expertise in ML systems and performance optimization with strong leadership and people management skills. You will recruit, mentor, and grow a high-performing team of engineers, fostering a culture of innovation, collaboration, and continuous improvement.
What you'll be doing:
What we're looking for:
Motional is a driverless technology company making autonomous vehicles a safe, reliable, and accessible reality. We’re driven by something more.
Our journey is always people first.
We aren't just developing driverless cars; we're creating safer roadways, more equitable transportation options, and making our communities better places to live, work, and connect. Our team is made up of engineers, researchers, innovators, dreamers and doers, who are creating a technology with the potential to transform the way we move.
Higher purpose, greater impact.
We’re creating first-of-its-kind technology that will transform transportation. To do so successfully, we must design for everyone in our cities and on our roads. We believe in building a great place to work through a progressive, global culture that is diverse, inclusive, and ensures people feel valued at every level of the organization. Diversity helps us to see the world differently; it’s not only good for our business, it’s the right thing to do.
Scale up, not starting up.
Our team is behind some of the industry's largest leaps forward, including the first fully-autonomous cross-country drive in the U.S, the launch of the world's first robotaxi pilot, and operation of the world's longest-standing public robotaxi fleet. We’re driven to scale; we’re moving towards commercialization of our technology, and we need team members who are ready to embrace change and challenges.
Formed as a joint venture between Hyundai Motor Group and Aptiv, Motional is fundamentally changing how people move through their lives. Headquartered in Boston, Motional has operations in the U.S and Asia. For more information, visit www.Motional.com and follow us on Twitter, LinkedIn, Instagram and YouTube.
Motional AD Inc. is an EOE. We celebrate diversity and are committed to creating an inclusive environment for all employees. To comply with Federal Law, we participate in E-Verify. All newly-hired employees are queried through this electronic system established by the DHS and the SSA to verify their identity and employment eligibility.
Ready to apply?
Apply to Motional
About WPP Media
WPP is the trusted growth partner for the world’s leading brands. With exceptional talent, trusted data and intelligence, and world-class partnerships – all united by our pioneering agentic marketing platform, WPP Open – we help clients navigate change, capture opportunity, and deliver transformational growth.
WPP Media is WPP's AI-driven media operating unit, bringing together media, data, and partnerships to deliver creative personalisation at scale. Connected through WPP Open and powered by Open Intelligence, clients see exactly where, how, and why their media investment is working.
For more information, visit wppmedia.com.
About WPP Media
WPP is the trusted growth partner for the world’s leading brands. With exceptional talent, trusted data and intelligence, and world-class partnerships – all united by our pioneering agentic marketing platform, WPP Open – we help clients navigate change, capture opportunity, and deliver transformational growth.
WPP Media is WPP's AI-driven media operating unit, bringing together media, data, and partnerships to deliver creative personalisation at scale. Connected through WPP Open and powered by Open Intelligence, clients see exactly where, how, and why their media investment is working.
For more information, visit wppmedia.com.
Role Summary and Impact
The Media Futures Group AI Squad is a small, agile, and autonomous innovation team dedicated to charting the course for AI transformation across APAC for one of WPP’s most important and strategic global technology clients.
By seamlessly integrating AI across media, creative, and production, this cross-functional squad delivers cutting-edge innovation and pioneering work to deliver more effective and efficient business outcomes.
The team’s mission is to solve specific business challenges by delivering demonstrable proof of AI-enabled marketing improvements, providing strategic guideposts for the broader organization.
Utilizing best-in-class tools like WPP Open and Gemini, you will innovate and deliver award-winning creative ideas while building, testing, and scaling solutions that enhance effectiveness, efficiency, and execution for our clients.
The AI Solutions Engineer serves as the master builder executing the construction. This highly technical role provides the dedicated, hands-on development support required to forcefully move theoretical architectural visions out of isolated testing environments and into highly secure, Google-approved, enterprise-grade production infrastructures. The AI Solutions Engineer bridges the critical, often perilous gap between a fragile, locally hosted prototype and a highly scalable, robust organizational tool capable of withstanding massive global traffic.
Responsibilities
Skills and Experience
Key Competencies
Life at WPP Media & Benefits
Our passion for shaping the next era of media is powered by our commitment to Be Extraordinary, investing in our employees to inspire transformational creativity. We also Lead Optimistically, firmly believing in and Championing Growth and Development for every individual. This commitment allows WPP Media employees to leverage the extensive global WPP Media & WPP networks to pursue their passions, build vital professional connections, and learn at the cutting edge of marketing and advertising.
We Create an Open environment built on trust and respect, where everyone feels they belong and has opportunities to progress. This inclusive culture is fostered through a variety of employee resource groups and frequent in-office events showcasing team wins, sharing thought leadership, and celebrating holidays and milestone events. Our comprehensive benefits package reflects this commitment, including competitive medical, vision, and dental insurance, significant paid time off, preferential partner discounts, and employee mental health awareness days.
WPP Media is an equal opportunity employer and considers applicants for all positions without discrimination or regard to characteristics. We believe the best work happens when we're together, fostering creativity, collaboration, and connection in this open and supportive environment. That's why we’ve adopted a hybrid approach, with teams in the office around four days a week. If you require accommodations or flexibility, please discuss this with the hiring team during the interview process.
Please note that while our philosophy is the same across WPP, benefits may vary by office/country.
#LI-Regional
Please read our Privacy Notice for more information on how we process the information you provide.
Ready to apply?
Apply to WPP MediaShare this job
SimplifyNext is a fast-growing consulting and technology firm founded by veterans from top-tier consulting companies, focused on AI, Automation, and Application Platforms. Our mission is to drive business transformation across industries by combining strategic insight with deep technical expertise.
We work with leading enterprises and public sector organisations across Singapore and the Asia Pacific region to design, build, and operate scalable digital and automation platforms — delivering impactful transformations for global and local organisations alike.
Built as an agile practice, we mentor and grow the next generation of consulting and technology experts. We invest heavily in structured training and enablement programmes that help our teams expand across Intelligent Automation, Test Automation, AI-powered workflows, and Agentic AI solutions.
Recognised as one of the fastest-growing companies in Singapore and Asia Pacific, SimplifyNext is positioned as one of the most credible and ambitious digital transformation teams in the region.
We’re not hiring someone to run models. We’re hiring someone who builds systems that think.
At SimplifyNext, our AI Engineers are core to how we deliver transformation — designing and deploying intelligent systems that genuinely change how organisations operate. You won’t be a supporting act to another team. You’ll be the one building the agents, pipelines, and infrastructure that make our AI products real.
We work across public sector and enterprise, at the intersection of AI, automation, and product-led transformation. If you’re energised by hard engineering problems, care about production outcomes - not just research benchmarks - and want your work to reach real users at scale, read on.
Must-Have
Good to Have
This role is not for you if…
We partner with governments and enterprises to shift from project delivery to product thinking. That means working on problems that genuinely matter — healthcare access, business licensing, workforce development — and being held accountable for outcomes, not just deliverables.
|
High-impact problem spaces |
Public sector and enterprise transformation, AI, and automation at scale across ASEAN and Asia Pacific. |
|
Engineering-first culture |
You’ll work alongside world-class architects, developers, and AI practitioners who set a high bar. |
|
End-to-end ownership |
You own problems fully — from architecture decisions to production operations — not just one slice. |
|
Learning environment |
Full certification sponsorship, structured learning paths, and direct mentorship from day one. |
At SimplifyNext, we’re committed to building a team of curious, driven, and forward-thinking individuals who care deeply about creating meaningful impact through technology. If you’re excited by the opportunity to grow, collaborate, and shape the future of digital transformation across the region, we’d be happy to hear from you.
Ready to apply?
Apply to SimplifyNext
Share this job
We are seeking a skilled MLOps & Agentic Platform Engineer. This role involves managing model registries, developing continuous training loops, and implementing A/B testing infrastructure. The ideal candidate will have a strong DevOps/MLOps background and be adept at deploying scalable microservices and building observability dashboards.
Responsibilities:
Qualifications:
Ready to apply?
Apply to Hyphen Connect Limited
Step into a career with ASM, where cutting edge technology meets collaborative culture.
For over 55 years ASM has been ahead of what’s next, at the forefront of innovation and what’s technologically possible. With more than 4,500 ASMers representing 70 nationalities, our people and our advanced semiconductor devices are playing a crucial role in trends such as 5G, cloud computing, AI, and autonomous driving. But we’re more than just a tech company. We value diversity, inclusion and sustainability as we strive to make a positive impact on the world. Our development programs help support your growth, shaping your future and pushing the boundaries of innovation to unleash potential.
Job's mission
As a Senior Specialist in AI/ML within ASM’s Operations Intelligence function, you will play a pivotal role in reimagining how data, artificial intelligence, and intelligent automation transform global operations. You will design, build, and deploy scalable AI and machine learning solutions that optimize semiconductor supply chain, manufacturing, and logistics performance. By translating complex operational challenges into impactful AI-driven solutions, you will help demonstrate the power of agentic AI, large language models, and advanced analytics—driving measurable business outcomes and shaping the future of smart manufacturing at scale.
What you will be working on
What we are looking for
What sets you apart
Apply today to be part of what’s next.
We make the tech that enables the chips in devices which improve lives around the world. We do this with an eye to the future, pushing the boundaries of what’s possible through cutting-edge innovation, and driving the next wave of technological breakthroughs that shape how we live, work, and connect.
To learn more about ASM, find us at asm.com and on LinkedIn, Facebook, Instagram, X and YouTube.
ASM is an equal opportunity employer and considers qualified applicants for employment without regard to race, color, religion, age, nationality, social or ethnic origin, sexual orientation, gender, gender identify or expression, marital status, pregnancy, political affiliation, disability, genetic information, veteran status, or any other characteristic protected by law.
Ready to apply?
Apply to ASM
Few compliance analytics roles offer this combination: genuinely novel problems, global scale, and the freedom to build rather than maintain. At a global crypto exchange, the financial crime data landscape is more varied, more real-time, and more analytically rich than almost anywhere in traditional finance. The regulatory environment is evolving quickly, the typologies are new territory, and the analytical work has direct impact on how the organisation detects and responds to financial crime. This role sits within the Product organisation and is dedicated entirely to that space, covering AML, sanctions, KYC/KYB, transaction monitoring, and beyond.
You will join a collaborative team of data scientists, data engineers, and business analysts, working closely with compliance stakeholders across the full range of financial crime domains. The role sits at the point where product engineering culture meets compliance depth, and you will have room to contribute across both. It is a role well suited to someone who enjoys varied, substantive work and values being part of a team that takes the quality of its output seriously.
The team is actively building toward an AI-native way of working. LLM-assisted coding, automated analytical pipelines, and AI-augmented investigation tools are already part of how the team operates or in active development. For someone who has wanted to apply AI seriously in a compliance context without cutting corners on rigour or auditability, this is an environment where that work is already underway and genuinely valued.
#LI-ONSITE
#LI-WWW
Ready to apply?
Apply to OKX
Building ML systems for compliance at a crypto exchange is a different kind of problem from most ML engineering work. The data spans on-chain transactions, fiat flows, KYC records, and behavioural signals that very few organisations have in one place. The problems are genuinely unsolved, the stakes are high, and the work has direct bearing on how a global exchange detects and responds to financial crime. For someone who wants their engineering work to matter beyond model accuracy metrics, this is an interesting place to be.
This role sits within a team of data scientists, analytics engineers, and compliance specialists who are building the analytical and AI infrastructure that powers the compliance function. You will work across the full ML lifecycle, from feature pipelines and model development through to deployment and monitoring, with close involvement from the domain experts who understand what the models need to do in practice.
AI-assisted development is how this team works. LLM-assisted coding, automated analytical pipelines, and AI-powered investigation tooling are part of the daily workflow. We are looking for engineers who already operate this way and who can raise the bar for what that looks like in a production compliance environment.
#LI-ONSITE
#LI-WWW
Ready to apply?
Apply to OKX
Share this job
About the role
We are seeking an experienced Machine Learning Lead to helm our Machine Learning team.
In this pivotal role, you will be the engineering architect behind Vulcan’s core AI capabilities. You will act as the nexus between Research, Platform, and Product. Your mission is to translate cutting-edge findings on GenAI threats into robust, production-ready machine learning models that power our GenAI Security Guardrails (Blue Team) and Automated Vulnerability Assessment (Red Team).
Crucially, you will serve as the bridge between deep tech and business strategy, articulating technical constraints (like FLOPS and latency) to leadership and clients while guiding the engineering direction.
2. MLOps& Data Infrastructure:
3. Cross-Functional Implementation & Leadership:
4. Technical Strategy & Stakeholder Management:
Qualifications
Ready to apply?
Apply to AIFT
Share this job
At HeyGen, our mission is to make visual storytelling accessible to all. Over the last decade, visual content has become the preferred method of information creation, consumption, and retention. But the ability to create such content, in particular videos, continues to be costly and challenging to scale. Our ambition is to build technology that equips more people with the power to reach, captivate, and inspire audiences.
Learn more at www.heygen.com. Visit our Mission and Culture doc here.
We are seeking a seasoned Technical Leader to build and scale the foundational compute infrastructure that powers our state-of-the-art AI models—from multimodal training data pipelines to high-throughput, low-latency video generation.
You will be the core engineer responsible for building the robust, efficient, and scalable platform that enables our research and production teams to rapidly iterate on HeyGen's generative video models. Your contributions will directly impact model performance, developer productivity, and the final quality of every AI-generated video.
Optimize GPU Utilization: Design and implement mechanisms to aggressively optimize GPU and cluster utilization across thousands of devices for inference, training, data processing and large-scale deployment of our state-of-art video generation models.
Develop Large-Scale AI Job Framework: Build highly scalable, reliable frameworks for launching and managing massive, heterogeneous compute jobs, including multi-modal high-volume data ingestion/processing, distributed model training, and continuous evaluation/benchmarking.
Enhance Observability: Develop world-class observability, tracing, and visualization tools for our compute cluster to ensure reliability, diagnose performance bottlenecks (e.g., memory, bandwidth, communication).
Accelerate Pipelines: Collaborate closely with AI researchers and AI engineers to integrate innovative acceleration techniques (e.g., custom CUDA kernels, distributed training libraries) into production-ready, scalable training and inference pipelines.
Infrastructure Management: Champion the adoption and optimization of modern cloud and container technologies (Kubernetes, Ray) for elastic, cost-efficient scaling of our distributed systems.
We are looking for a highly motivated engineer with deep experience operating and optimizing AI infrastructure at scale.
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
5+ years of full-time industry experience in large-scale MLOps, AI infrastructure, or HPC systems.
Experience with data frameworks and standards like Ray, Apache Spark, LanceDB
Strong proficiency in Python and a high-performance language such as C++ for developing core infrastructure components.
Deep understanding and hands-on experience with modern orchestration and distributed computing frameworks such as Kubernetes and Ray.
Experience with core ML frameworks such as PyTorch, TensorFlow, or JAX.
Master's or PhD in Computer Science or a related technical field.
Demonstrated Tech Lead experience, driving projects from conceptual design through to production deployment across cross-functional teams.
Prior experience building infrastructure specifically for Generative AI models (e.g., diffusion models, GANs, or large language models) where cost and latency are critical.
Proven background in building and operating large-scale data infrastructure (e.g., Ray, Apache Spark) to manage petabytes of multi-modal data (video, audio, text).
HeyGen is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
Ready to apply?
Apply to HeyGen
Share this job
At HeyGen, our mission is to make visual storytelling accessible to all. Over the last decade, visual content has become the preferred method of information creation, consumption, and retention. But the ability to create such content, in particular videos, continues to be costly and challenging to scale. Our ambition is to build technology that equips more people with the power to reach, captivate, and inspire audiences.
Learn more at www.heygen.com. Visit our Mission and Culture doc here.
We are seeking a seasoned Software Engineer to build and scale the foundational compute infrastructure that powers our state-of-the-art AI models—from multimodal training data pipelines to high-throughput, low-latency video generation.
You will be the core engineer responsible for building the robust, efficient, and scalable platform that enables our research and production teams to rapidly iterate on HeyGen's generative video models. Your contributions will directly impact model performance, developer productivity, and the final quality of every AI-generated video.
Optimize GPU Utilization: Design and implement mechanisms to aggressively optimize GPU and cluster utilization across thousands of devices for inference, training, data processing and large-scale deployment of our state-of-art video generation models.
Develop Large-Scale AI Job Framework: Build highly scalable, reliable frameworks for launching and managing massive, heterogeneous compute jobs, including multi-modal high-volume data ingestion/processing, distributed model training, and continuous evaluation/benchmarking.
Enhance Observability: Develop world-class observability, tracing, and visualization tools for our compute cluster to ensure reliability, diagnose performance bottlenecks (e.g., memory, bandwidth, communication).
Accelerate Pipelines: Collaborate closely with AI researchers and AI engineers to integrate innovative acceleration techniques (e.g., custom CUDA kernels, distributed training libraries) into production-ready, scalable training and inference pipelines.
Infrastructure Management: Champion the adoption and optimization of modern cloud and container technologies (Kubernetes, Ray) for elastic, cost-efficient scaling of our distributed systems.
We are looking for a highly motivated engineer with deep experience operating and optimizing AI infrastructure at scale.
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
5+ years of full-time industry experience in large-scale MLOps, AI infrastructure, or HPC systems.
Experience with data frameworks and standards like Ray, Apache Spark, LanceDB
Strong proficiency in Python and a high-performance language such as C++ for developing core infrastructure components.
Deep understanding and hands-on experience with modern orchestration and distributed computing frameworks such as Kubernetes and Ray.
Experience with core ML frameworks such as PyTorch, TensorFlow, or JAX.
Master's or PhD in Computer Science or a related technical field.
Demonstrated Tech Lead experience, driving projects from conceptual design through to production deployment across cross-functional teams.
Prior experience building infrastructure specifically for Generative AI models (e.g., diffusion models, GANs, or large language models) where cost and latency are critical.
Proven background in building and operating large-scale data infrastructure (e.g., Ray, Apache Spark) to manage petabytes of multi-modal data (video, audio, text).
HeyGen is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
Ready to apply?
Apply to HeyGen
Share this job
We are seeking highly motivated and curious individuals to join our Machine Learning team at Kronos Research. In this role, you will bridge the gap between advanced deep learning and financial markets, designing robust models for medium and high-frequency systematic trading strategies. You will manage the full ML lifecycle, from researching novel architectures to deploying scalable, low-latency models that directly drive trading revenue.
Key Responsibilities
Qualifications
Ready to apply?
Apply to Kronos Research
Cookies & analytics
This site uses cookies from third-party services to deliver its features and to analyze traffic.