Build — infrastructure & cloud
Work mode: Remote

Site Reliability Engineer (SRE)

Once AI is in production it has to stay up and behave. You own reliability: observability, scaling, incident response and the operational practices that keep client AI systems dependable under real load.

What you’ll do

  • Build observability for AI systems: metrics, logs, traces and alerting.
  • Define SLOs and own scaling, reliability and incident response.
  • Run postmortems and turn incidents into durable fixes.
  • Monitor model behaviour, latency and cost in production.

What we look for

  • SRE or production operations experience on real systems.
  • Strong observability, debugging and incident-handling skills.
  • Comfort with cloud, containers and orchestration.
  • Calm judgement under pressure.

Nice to have

  • Experience operating ML or LLM systems in production.
  • Professional French and English.

Apply for this role

Send your CV and a short note on why this role fits. You will hear back from a person, not a tracker.

Work on AI that ships

Real client systems in production, not demos. If that is the work you want, we should talk.