Site Reliability Engineer (SRE)

Once AI is in production it has to stay up and behave. You own reliability: observability, scaling, incident response and the operational practices that keep client AI systems dependable under real load.

Apply by email

What you’ll do

Build observability for AI systems: metrics, logs, traces and alerting.
Define SLOs and own scaling, reliability and incident response.
Run postmortems and turn incidents into durable fixes.
Monitor model behaviour, latency and cost in production.

What we look for

SRE or production operations experience on real systems.
Strong observability, debugging and incident-handling skills.
Comfort with cloud, containers and orchestration.
Calm judgement under pressure.

Nice to have

Experience operating ML or LLM systems in production.
Professional French and English.

Apply for this role

Send your CV and a short note on why this role fits. You will hear back from a person, not a tracker.

Apply by email[email protected]

Other roles in this team

Work on AI that ships

Real client systems in production, not demos. If that is the work you want, we should talk.

Book a call Write to us