Site Reliability Engineer, Infrastructure and Assurance Services - USDS
New Yesterday
The Infra SRE-Infrastructure-Assurance team extends TikTok infrastructure's operability, observability, visibility, and automation. We aim to provide holistic insights and solutions to TikTok infrastructure with minimal manual interventions. We're young and fast-growing. Our team values transparency, collaboration, hard-work and innovations. We believe in planning and long-term strategies rather than short-term gains. Join us in solving large-scale complex issues in a hyper-growth team; Embracing challenges with a fearless curious mind.
In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time. Responsibilities:
- Perform SRE duties and operations on supported services in production, including but not limited to: on-call rotations, maintenance, change management, monitoring, incident response, capacity planning, disaster recovery.
- Maximize system uptime, availability and stability, to ensure functional and performance SLAs.
- Contribute to existing documentations and build effective documentations such as operational runbooks, SOPs, SLA/SLO.
- Initiate and lead scripting/tooling/automation to streamline processes and minimize human resource.
- Work cross functionally and regionally with SRE/Dev/QA/PM teams to handle incidents and improve processes.
- Manage and prioritize tasks/projects for high productivity and precise deliveries.
Minimum Qualifications:
-Bachelor's degree in Computer Science, a related field, or equivalent practical experience.
-demonstrated experience in software development with one or more programming languages.
-experience in Linux Operating Systems, Networking, Database concepts, Monitoring, Shell scripting.
-Superb analytical ability, problem solving and critical thinking skills.
-Excellent communicator, team-player, self-starter and fast learner. Preferred Qualifications:
-Master's degree in Computer Science, Engineering or a related field.
-Proficient in any of the following languages: Python, GoLang, C++.
-Expertise in any of the following: SRE philosophy, AIOPS, APM, Disaster Recovery.
-Expertise in any of these tech stacks: Kubernetes, ElasticSearch, ClickHouse, Message Queue, OpenTSDB, Service Mesh. Candidates for this position must be legally authorized to work in the United States. This position is not eligible for visa sponsorship or support.
- Location:
- San Jose