Senior DevOps Engineer
3 Days Old
Job Description
Aldea is a next-generation AI company focused on voice-based clinical and expert applications. Our flagship product, Advisor, uses proprietary AI to scale the impact of world-class minds across personal development, finance, parenting, relationships, and more. We’re on a mission to bring the best expert guidance in the world to people navigating real-life challenges — whether that’s parenting, relationships, health, or personal growth. Our consumer products are voice-first, AI-native, and designed to meet people where they are.
As a multidisciplinary team of builders, researchers, and product thinkers, we value clear thinking, sharp writing, and strong user-first intuition.
This is a rare opportunity to join an early-stage startup that will help define a new category.
Why This Role MattersWe're building AI infrastructure that scales. With 5 distinct environments managing complex multi-cluster Kubernetes deployments, we need infrastructure experts who can architect systems for production readiness while maintaining security and operational excellence. This isn't just maintaining servers—you'll be designing the backbone that powers our AI platform across development, staging, and production environments.
What You'll OwnMulti-Environment Kubernetes Architecture
- Manage 5 distinct environments (NMS, Sandbox, Development, Staging, Production) with different security and access requirements
- Design redundancy and failover mechanisms for our centralized NMS hub that manages all environments
Infrastructure as Code Excellence
- Develop and maintain Pulumi-based infrastructure using Python
- Manage complex cross-environment dependencies and VPC peering relationships
- Automate resource provisioning and configuration management
Zero-Trust Security Implementation
- Implement and maintain certificate-based VPN access with internal DNS resolution
- Configure WAF, security groups, and network policies for VPN-only access
- Manage HashiCorp Vault integration for secure credential management across environments
Comprehensive Observability
- Deploy and configure Prometheus, Grafana, Loki, Jaeger, and CloudWatch
- Implement unified monitoring across distributed infrastructure
- Design alerting and incident response procedures
API Platform Management
- Deploy and maintain the centralized API that manages all environments from the NMS hub
- Implement automation for managing training jobs and inference across multiple Kubernetes clusters
- Optimize GPU and CPU resource utilization across node groups
Experience & Technical Depth
- 5+ years in DevOps, SRE, or infrastructure engineering
- Expert-level Kubernetes experience with EKS and multi-cluster management
- Strong Python programming skills for infrastructure automation and API development
Infrastructure & Cloud Expertise
- Infrastructure as Code expertise with Pulumi, Terraform, or similar tools
- Deep AWS knowledge: VPC, EKS, ECR, S3, CloudWatch, IAM, and networking
- Linux system administration and containerization with Docker
Monitoring & Security
- Hands-on experience with Prometheus, Grafana, and centralized logging systems
- Network security experience including VPN, firewalls, and certificate management
- Understanding of zero-trust architecture principles
- Machine Learning infrastructure experience (GPU clusters, model serving, ML pipelines)
- HashiCorp Vault administration and integration
- GitOps experience with ArgoCD or similar tools
- Service mesh experience (Istio, Linkerd)
- Database administration (PostgreSQL, Redis, Elasticsearch)
- CI/CD pipeline design and multi-cloud infrastructure experience
Infrastructure Stability
- Zero unplanned downtime across production environments
- Successfully implement disaster recovery procedures with tested failover mechanisms
- Achieve 99.9% uptime SLA across all critical services
Security & Compliance
- Complete VPN-only access implementation with certificate-based authentication
- Successfully integrate HashiCorp Vault across all environments
- Pass security audit with comprehensive logging and monitoring in place
Operational Excellence
- Reduce infrastructure provisioning time by 50% through automation
- Implement comprehensive monitoring with <5 minute mean time to detection
- Optimize GPU utilization rates above 80% across training workloads
Architectural Complexity
- The NMS hub is a single point of failure—you'll architect redundancy without compromising centralized management
- Balance VPN-only security requirements with operational efficiency for remote team access
- Manage complex service discovery across 5 interconnected environments
Scale & Performance
- Optimize GPU resources across competing training and inference workloads
- Implement cost optimization strategies while maintaining performance requirements
- Design monitoring systems that scale with our infrastructure growth
Benefits
Compensation & Benefits
We are a well-funded, Seed-stage company preparing for launch. We offer:
- Competitive base salary
- Performance-based bonus based on achieving goals
- Equity participation
- Comprehensive benefits, including health, dental, vision, and paid time off
- Flexible work environment—based in Miami, hybrid OK. Remote considered.
- Location:
- Miami
- Category:
- Technology