Build — infrastructure & cloud
Work mode: RemoteSite Reliability Engineer (SRE)
Once AI is in production it has to stay up and behave. You own reliability: observability, scaling, incident response and the operational practices that keep client AI systems dependable under real load.
What you’ll do
- Build observability for AI systems: metrics, logs, traces and alerting.
- Define SLOs and own scaling, reliability and incident response.
- Run postmortems and turn incidents into durable fixes.
- Monitor model behaviour, latency and cost in production.
What we look for
- SRE or production operations experience on real systems.
- Strong observability, debugging and incident-handling skills.
- Comfort with cloud, containers and orchestration.
- Calm judgement under pressure.
Nice to have
- Experience operating ML or LLM systems in production.
- Professional French and English.
Apply for this role
Send your CV and a short note on why this role fits. You will hear back from a person, not a tracker.
Work on AI that ships
Real client systems in production, not demos. If that is the work you want, we should talk.