Companies β€Ί Embedding VC β€Ί Founding Platform & Reliability Engineer

About the role

Embedding VC

Founding Platform & Reliability Engineer

🎨 About OpenArt

OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters, and stories with unprecedented speed and imagination. We believe the future of creativity is AI-native, and we're shaping that future.

πŸš€ Why Join OpenArt

  • Small team, massive surface area, senior engineers own real systems, not slices.

  • Ship at real scale, your work goes to millions of users, fast.

  • Founder-led engineering culture, both founders are technical and deeply involved in product and architecture.

  • AI-native product, you’ll design how cutting-edge AI models are exposed as real user experiences.

  • High ownership, low process, we value judgment, clarity, and speed over bureaucracy.

  • 7-10X growth in revenue for the past 2 years. Now you’ll play a critical role in helping the company scale to the next stage.

🎯 About the Role

We’re looking for a Founding Platform & Reliability Engineer who can own the design, scalability, and reliability of our entire infrastructure stack end-to-end, from high-level architecture decisions to hands-on implementation, observability, and cost optimization.

This is NOT a role for traditional operators or narrow DevOps specialists. You should be comfortable working across cloud infrastructure, distributed systems, backend services, and developer tooling, making pragmatic decisions that balance product velocity, system reliability, and cost efficiencyβ€”especially in a fast-evolving, AI-native environment.

You will work closely with the founders and product engineers to design and evolve the platform that powers OpenArt, shaping key decisions such as serverless vs. containerized architecture, multi-provider AI reliability, and scaling systems to millions of usersβ€”while acting as a force multiplier for the entire engineering team.

πŸ›  What You’ll Do

  • Define and operationalize SLOs/SLIs across critical user journeys (generation, editing, payments/credits, uploads, etc.), and use them to drive prioritization (including error budgets)

  • Participate in an on-call rotation and lead incident response improvements (alert quality, runbooks, escalation paths). Establish blameless postmortems and ensure action items are implemented.

  • Implement reliability patterns at external boundaries, and build mechanisms for per-vendor β€œhealth” measurement and routing/fallback policies

  • Stand up end-to-end observability: structured logs, metrics, traces, and dashboards that let engineers answer β€œwhat broke” and β€œwhy now” quickly.

  • Build deploy safety practices: automated rollbacks, canarying, feature-flag patterns, and reliable CI/CD gates.

  • Own the direction of our infrastructure architecture, including defining when serverless is the right approach versus when we should evolve toward containerized or more managed systems, and guiding the team through those transitions as we scale.

  • Build cost observability and cost-control primitives: per-request cost attribution, caching strategies, capacity planning, and budget alerts.

  • Act as a senior technical voice, influencing architecture, tooling, engineering best practices, and raising the overall engineering bar.

πŸ§‘β€πŸ’» What We’re Looking For

Core Requirements

  • 5+ years building and operating production systems where reliability and scaling are core.

  • Strong software engineering skills (you can ship production code, not just configure tools).

  • Cloud-native experience (AWS or GCP), ideally with serverless/event-driven systems and at least one container path (Fargate/ECS/Cloud Run/Kubernetes).

  • Deep knowledge of observability practices: dashboards, alerting, distributed tracing, and incident response maturity.

  • Ability to design resilient interactions with external dependencies (timeouts, retries/backoff/jitter, circuit breakers, idempotency).

  • Can communicate tradeoffs to non-infra peers clearly

  • Ability to operate with ambiguity and define problems before solving them.

Nice to Have

  • Have designed an internal platform abstraction (e.g., API gateway / workflow engine / job orchestration) that enabled multiple product teams to ship faster with fewer incidents.

  • Have shipped concrete reliability outcomes: e.g., reduced MTTR, improved SLO attainment, lowered p95 latency, or reduced infra/unit costs

  • Prior startup experience or experience owning large surface-area features.

βš™ Tech Stack You’ll Work With

GCP, Cloud Run, Modal, Upstash, Sentry, Amplitude, Firebase, Redis, React / Next.js, Node.js, TypeScript, Python, etc.

πŸ’° Compensation

  • Competitive base salary and bonus program

  • Equity - meaningful ownership in what you build

  • High autonomy, high growth environment

🌍 Work Setup

  • Bay Area preferred (hybrid allowed)

  • Visa sponsorship available

  • We’ll consider remote

Ready to apply to Embedding VC?
Apply to Embedding VC

Similar jobs

Coupa
Lead Database Reliability Engineer - 11606
Coupa
⚑ Apply early San Francisco Bay Area, United... Remote
● New πŸ‘ Seen βœ“ Applied 1w ago
MI
Site Reliability Engineer (SRE)
Mithril
⚑ Apply early Palo Alto / San Francisco Bay... $170,000–$230,000
● New πŸ‘ Seen βœ“ Applied 2mos ago
Altoros
(803) Senior Python Developer - L3 Support (SRE/Python + Unity-integration)
Altoros
⚑ Apply early Warsaw, Masovian Voivodeship,... Remote
● New πŸ‘ Seen βœ“ Applied 4h ago
Tempo
Senior Site Reliability Engineer
Tempo
⚑ Apply early Spain Remote
● New πŸ‘ Seen βœ“ Applied 5h ago
Dovetail
Site Reliability Engineer
Dovetail
⚑ Apply early Sydney Hybrid
● New πŸ‘ Seen βœ“ Applied 5h ago
Scaleway
Site Reliability Engineer (SRE) - AI GPU Clusters
Scaleway
⚑ Apply early Paris Hybrid
● New πŸ‘ Seen βœ“ Applied 5h ago
Verisign
Site Reliability Engineer - IBM AIX
Verisign
⚑ Apply early Reston,Virginia,United States Hybrid $135,800–$183,800
● New πŸ‘ Seen βœ“ Applied 6h ago
Verisign
SRE - Linux
Verisign
⚑ Apply early Reston,Virginia,United States Hybrid $135,800–$183,800
● New πŸ‘ Seen βœ“ Applied 6h ago
Verisign
Site Reliability Engineer
Verisign
⚑ Apply early Reston,Virginia,United States Hybrid $135,800–$183,800
● New πŸ‘ Seen βœ“ Applied 6h ago

Sign up for suggestions tailored to the jobs you open and the searches you save.

Apply now
πŸ€–

Whoa β€” hold up

JobsRadar was built for real people having a rough time in their job search β€” not for automated requests. You're clicking way too fast and you're now temporarily blocked.

Come back later. If you're genuinely job hunting, we've got your back β€” just act like a human.

Catch your next role the second it’s posted.

Create a free account and we’ll watch the boards for you — the instant a job matches your search, it lands in your inbox or Telegram. No digging, no refreshing.

Create free account

Free forever · takes 30 seconds · already have one?

Get the worldwide-remote edge.

Join our Telegram channel for the stuff that helps you land the role β€” salary benchmarks, the weekly market pulse, and new-feature drops. No spam, just signal.

Join the channel β€” it's free