Join a transversal SRE community embedded in a product-oriented Data Platform team of 40–50 engineers, analysts, and data scientists across France and Spain. You'll drive the reliability and scalability of a next-generation Lakehouse platform — anchored on Trino, Iceberg, and on-prem object storage — while leading the transition from public cloud to a resilient hybrid/on-prem architecture.
🎯 TASKS
Platform Reliability & SRE foundations
Own reliability of core data services: Trino, Iceberg, S3 / Ceph, Kafka, Kafka Connect, Schema Registry
Define and enforce SLIs/SLOs, error budgets, and on-call runbooks — solid SRE foundations are non-negotiable
Build full-stack observability with Prometheus and Grafana: metrics, dashboards, alerting pipelines, and anomaly detection
Manage and harden PostgreSQL clusters via Patroni for high-availability control-plane services
Kafka ecosystem — Connect & Schema governance
Operate and scale Kafka Connect clusters: connector lifecycle, offset management, dead-letter queues, and task rebalancing
Maintain the Schema Registry as the single source of truth for Avro/Protobuf/JSON schemas — enforce compatibility rules and schema evolution policies
Monitor consumer lag, connector throughput, and broker health via Prometheus JMX exporters and Grafana dashboards
Ensure end-to-end data contract integrity between producers and Iceberg/S3 consumers
Kubernetes, Kube-in-Kube & Crossplane
Operate production Kubernetes clusters (GKE/EKS + on-prem) — capacity planning, upgrades, PodDisruptionBudgets, resource quotas
Architect and manage Kube-in-Kube topologies to provide strong tenant isolation for data platform workloads — each team gets a dedicated virtual cluster without the overhead of a full physical cluster
Automate infrastructure and resource provisioning with Crossplane: define composite resources (XRDs) so data teams can self-serve Kafka topics, Trino namespaces, and S3 buckets through Kubernetes-native APIs
Maintain GitOps pipelines for platform deployment and configuration drift detection
Lakehouse architecture & cloud migration
Migrate from public cloud data warehouse to VeepeeCloud Iceberg-based lakehouse — managing coexistence, schema evolution, and time-travel
Architect resilient ingestion, transformation, and serving layers around Trino + S3
Optimize Trino query performance: memory limits, spilling, cost-based optimizer tuning
Agentic & developer enablement
Build agentic self-service tooling so data teams can provision Trino/Iceberg resources and Kafka Connect pipelines autonomously via Crossplane — reducing toil and ops bottlenecks
Develop FinOps dashboards (compute, storage, query cost) with Grafana and Prometheus-based cost exporters
Write clear technical documentation, runbooks, and internal ADRs
Multi-DC resilience & DRP
Design and implement multi-datacenter strategies across FR1 / NL1 — active-active and active-passive topologies
Leverage Fast Erasure Coding on object storage (Ceph/S3) to maximize durability with minimal replication overhead
Ensure data replication consistency across sites for Iceberg table metadata, Trino catalogs, and Schema Registry subjects
Lead DRP exercises: failover playbooks, RTO/RPO validation, postmortems
👉 MUST HAVE skills
Must have
- Strong experience with Kubernetes in production environments
- Experience with Kube-in-Kube technologies (vCluster or similar)
- Solid understanding of SRE principles (SLIs/SLOs, error budgets)
- Experience with Prometheus and Grafana
- Experience with Infrastructure as Code (Terraform or similar)
- Experience with Crossplane
- Familiarity with GitOps workflows
- Experience with S3 and object storage technologies
- Experience with PostgreSQL and Patroni
- Experience with Kafka, Kafka Connect, and Schema Registry
- Fluent in English
👉 NICE TO HAVE skills
Experience with multi-datacenter architectures (FR1/NL1)
Experience designing disaster recovery plans and failover playbooks
Experience with Fast Erasure Coding (Ceph/S3)
Experience with Trino, Iceberg, and Lakehouse technologies
Experience with Airflow
Experience building agentic self-service platforms
Knowledge of FinOps and cost optimization practices
Programming experience in Python, Java, or Go
✅ BENEFITS
Variable bonus
E-learning platform (self-education courses)
Meetups & conferences (local and international)
Flexible office — up to 2 days remote
International teams (France & Spain)
⚙️ RECRUITMENT PROCESS
1️⃣ 30-minute HR Screen with a Veepeeᵀᵉᶜʰ Recruiter
2️⃣ General Technical exchange
3️⃣ Technical exchange with the manager
4️⃣ Team Interview
We are convinced that it is up to you to define the way you work, to develop yourself and to progress.
At Veepee we guarantee that you can just be yourself!
For the service of diversity and inclusion, Veepee is committed to reviewing all applications received on an equal basis.
🔗COMPANY For more information about our ecosystem : https://careers.veepee.com/en/home-page-en/