About the role

Veepee · Hybrid

Join a transversal SRE community embedded in a product-oriented Data Platform team of 40–50 engineers, analysts, and data scientists across France and Spain. You'll drive the reliability and scalability of a next-generation Lakehouse platform — anchored on Trino, Iceberg, and on-prem object storage — while leading the transition from public cloud to a resilient hybrid/on-prem architecture.

🎯 TASKS

Platform Reliability & SRE foundations

Own reliability of core data services: Trino, Iceberg, S3 / Ceph, Kafka, Kafka Connect, Schema Registry

Define and enforce SLIs/SLOs, error budgets, and on-call runbooks — solid SRE foundations are non-negotiable

Build full-stack observability with Prometheus and Grafana: metrics, dashboards, alerting pipelines, and anomaly detection

Manage and harden PostgreSQL clusters via Patroni for high-availability control-plane services

Kafka ecosystem — Connect & Schema governance

Operate and scale Kafka Connect clusters: connector lifecycle, offset management, dead-letter queues, and task rebalancing

Maintain the Schema Registry as the single source of truth for Avro/Protobuf/JSON schemas — enforce compatibility rules and schema evolution policies

Monitor consumer lag, connector throughput, and broker health via Prometheus JMX exporters and Grafana dashboards

Ensure end-to-end data contract integrity between producers and Iceberg/S3 consumers

Kubernetes, Kube-in-Kube & Crossplane

Operate production Kubernetes clusters (GKE/EKS + on-prem) — capacity planning, upgrades, PodDisruptionBudgets, resource quotas

Architect and manage Kube-in-Kube topologies to provide strong tenant isolation for data platform workloads — each team gets a dedicated virtual cluster without the overhead of a full physical cluster

Automate infrastructure and resource provisioning with Crossplane: define composite resources (XRDs) so data teams can self-serve Kafka topics, Trino namespaces, and S3 buckets through Kubernetes-native APIs

Maintain GitOps pipelines for platform deployment and configuration drift detection

Lakehouse architecture & cloud migration

Migrate from public cloud data warehouse to VeepeeCloud Iceberg-based lakehouse — managing coexistence, schema evolution, and time-travel

Architect resilient ingestion, transformation, and serving layers around Trino + S3

Optimize Trino query performance: memory limits, spilling, cost-based optimizer tuning

Agentic & developer enablement

Build agentic self-service tooling so data teams can provision Trino/Iceberg resources and Kafka Connect pipelines autonomously via Crossplane — reducing toil and ops bottlenecks

Develop FinOps dashboards (compute, storage, query cost) with Grafana and Prometheus-based cost exporters

Write clear technical documentation, runbooks, and internal ADRs

Multi-DC resilience & DRP

Design and implement multi-datacenter strategies across FR1 / NL1 — active-active and active-passive topologies

Leverage Fast Erasure Coding on object storage (Ceph/S3) to maximize durability with minimal replication overhead

Ensure data replication consistency across sites for Iceberg table metadata, Trino catalogs, and Schema Registry subjects

Lead DRP exercises: failover playbooks, RTO/RPO validation, postmortems

👉 MUST HAVE skills

Must have

Strong experience with Kubernetes in production environments
Experience with Kube-in-Kube technologies (vCluster or similar)
Solid understanding of SRE principles (SLIs/SLOs, error budgets)
Experience with Prometheus and Grafana
Experience with Infrastructure as Code (Terraform or similar)
Experience with Crossplane
Familiarity with GitOps workflows
Experience with S3 and object storage technologies
Experience with PostgreSQL and Patroni
Experience with Kafka, Kafka Connect, and Schema Registry
Fluent in English

👉 NICE TO HAVE skills

Experience with multi-datacenter architectures (FR1/NL1)

Experience designing disaster recovery plans and failover playbooks

Experience with Fast Erasure Coding (Ceph/S3)

Experience with Trino, Iceberg, and Lakehouse technologies

Experience with Airflow

Experience building agentic self-service platforms

Knowledge of FinOps and cost optimization practices

Programming experience in Python, Java, or Go

✅ BENEFITS

Variable bonus

E-learning platform (self-education courses)

Meetups & conferences (local and international)

Flexible office — up to 2 days remote

International teams (France & Spain)

⚙️ RECRUITMENT PROCESS

1️⃣ 30-minute HR Screen with a Veepeeᵀᵉᶜʰ Recruiter

2️⃣ General Technical exchange

3️⃣ Technical exchange with the manager

4️⃣ Team Interview

We are convinced that it is up to you to define the way you work, to develop yourself and to progress.

At Veepee we guarantee that you can just be yourself!

For the service of diversity and inclusion, Veepee is committed to reviewing all applications received on an equal basis.

🔗COMPANY For more information about our ecosystem : https://careers.veepee.com/en/home-page-en/

Ready to apply to Veepee?

Apply to Veepee

About the role

🎯 TASKS

Platform Reliability & SRE foundations

Kafka ecosystem — Connect & Schema governance

Kubernetes, Kube-in-Kube & Crossplane

Lakehouse architecture & cloud migration

Agentic & developer enablement

Multi-DC resilience & DRP

👉 MUST HAVE skills

Must have

👉 NICE TO HAVE skills

✅ BENEFITS

⚙️ RECRUITMENT PROCESS

Similar jobs

Whoa — hold up

About the role

🎯 TASKS

Platform Reliability & SRE foundations

Kafka ecosystem — Connect & Schema governance

Kubernetes, Kube-in-Kube & Crossplane

Lakehouse architecture & cloud migration

Agentic & developer enablement

Multi-DC resilience & DRP

👉 MUST HAVE skills

Must have

👉 NICE TO HAVE skills

✅ BENEFITS

⚙️ RECRUITMENT PROCESS

Similar jobs

Whoa — hold up

Catch your next role the second it’s posted.

Get the worldwide-remote edge.