Senior / Staff Site Reliability Engineer, Storage

New Yesterday

About Fluidstack Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more.
If your skills, experience, and qualifications match those in this job overview, do not delay your application. Our team is small, highly motivated, and focused on providing a world class supercomputing experience. We put out customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals. We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us. You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset. About the Role Our Senior / Staff Site Reliability Engineer (Storage) is the connective tissue of Fluidstack’s platform. As part of a small, senior team you’ll own the availability, performance and cost-efficiency of our storage, compute and networking layers. You’ll combine software engineering, systems thinking and a relentless customer focus to keep our SLIs and SLOs razor-sharp — and to raise the bar every quarter. Focus Automate everything. Replace repetitive ops with Python / Go tooling, Kubernetes operators and GitOps workflows.
Tune the stack. Profile and optimise storage I/O paths, hypervisors and kernel parameters to crush tail-latency.
Harden for scale. Design failure-tolerant architectures, run game-days and embed chaos engineering to validate them.
Own incidents. Lead 24×7 on-call rotations, drive blameless post-mortems and turn lessons into lasting fixes.
Partner with engineers. Review designs, instrument new services and evangelise reliability patterns across product teams.
Measure what matters. Define SLIs/SLOs that map directly to customer experience and build dashboards/alerts to track them.
Drive continuous improvement. Identify tech debt, propose roadmap items and mentor engineers on reliability best practice.
About you 10+ yrs professional SRE / production-engineering experience, including large-scale architecture & design.
Proficiency in Python, Go or similar; able to write clean, tested, maintainable code.
Deep hands-on knowledge of Docker, Kubernetes, Terraform/Ansible, and modern CI/CD (GitLab, GitHub Actions, etc.).
Expertise in observability stacks (Prometheus, Grafana, OpenTelemetry) and incident-management workflows.
Strong grasp of Linux internals, TCP/IP networking and security best-practices.
Excellent written & verbal communication; comfortable leading cross-functional deep-dives.
Benefits Competitive total compensation package (cash + equity).
Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.
#J-18808-Ljbffr
Location:
San Francisco, CA
Salary:
$200
Category:
Engineering

We found some similar jobs based on your search