About the role

Deeproute.ai

Focus

Multimodal Foundation Models · Representation Learning · Method Innovation

We are looking for strong technical builders and researchers who deeply understand foundation models and representation learning beyond simply applying existing frameworks.

Ideal candidates should have:

Strong experimental rigor
Solid systems and modeling intuition
Hands-on engineering ability
Interest in scalable multimodal AI systems for real-world autonomy

We value people who can bridge research and production, and who care about robustness, scalability, efficiency, and practical deployment in large-scale autonomous driving systems.

Responsibilities

1. Large-Scale Foundation Model Pretraining

Develop scalable pretraining pipelines for large-scale multimodal driving data
Design and optimize training strategies for:

Vision-language-action models
Video foundation models
Long-context temporal modeling
Multimodal representation alignment

Improve:

Training stability
Data efficiency
Scaling efficiency
Representation robustness

Work on distributed training systems and large-scale model optimization using frameworks such as:

PyTorch Distributed
DeepSpeed
Megatron-LM

2. Representation Learning & Method Innovation

Design and improve self-supervised and multimodal learning methods for real-world autonomous driving systems
Conduct architecture-level research on:

Vision Transformers (ViT)
Video / temporal architectures
Multimodal fusion and alignment
Embedding and retrieval systems
Long-context and memory-efficient architectures

Explore and improve:

Pretraining objectives
Loss functions
Training paradigms
Generalization and robustness

Analyze model behavior through:

Rigorous ablation studies
Failure case analysis

Representation probing and evaluation

3. Efficient Foundation Models & Scalable Deployment

Improve the efficiency, scalability, and deployability of large multimodal foundation models for real-world autonomous driving systems
Work on areas such as:

Model quantization
Knowledge distillation
Efficient attention mechanisms
Sparse architectures and Mixture-of-Experts (MoE)
Long-context and memory-efficient modeling
Inference acceleration and serving optimization
Training and inference system efficiency

Optimize model throughput, latency, memory usage, and deployment performance for large-scale production environments

Requirements

MS or PhD in:

Computer Vision
Machine Learning
Robotics
Computer Science
Related fields

Strong understanding of:

Foundation models
Self-supervised learning
Representation learning
Multimodal learning
Large-scale pretraining

Hands-on experience with methods such as:

CLIP
DINO / DINOv2
MAE
Contrastive learning
Masked modeling
MoE or scalable transformer architectures

Experience with one or more of the following is highly valued:

Video foundation models
Long-context modeling
Retrieval systems
Efficient inference
Distributed training
Model compression and deployment optimization

Strong publication record in top-tier venues is preferred:

CVPR
ICCV
ECCV
NeurIPS
ICLR
ICML

Ready to apply to Deeproute.ai?

Apply to Deeproute.ai

About the role

Similar jobs

Whoa — hold up

About the role

Similar jobs

Whoa — hold up

Catch your next role the second it’s posted.

Get the worldwide-remote edge.