Site Reliability Engineer, Infrastructure and Assurance Services - USDS

New Yesterday

The Infra SRE-Infrastructure-Assurance team extends TikTok infrastructure's operability, observability, visibility, and automation. We aim to provide holistic insights and solutions to TikTok infrastructure with minimal manual interventions. We're young and fast-growing. Our team values transparency, collaboration, hard-work and innovations. We believe in planning and long-term strategies rather than short-term gains. Join us in solving large-scale complex issues in a hyper-growth team; Embracing challenges with a fearless curious mind. In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time. Responsibilities: - Perform SRE duties and operations on supported services in production, including but not limited to: on-call rotations, maintenance, change management, monitoring, incident response, capacity planning, disaster recovery. - Maximize system uptime, availability and stability, to ensure functional and performance SLAs. - Contribute to existing documentations and build effective documentations such as operational runbooks, SOPs, SLA/SLO. - Initiate and lead scripting/tooling/automation to streamline processes and minimize human resource. - Work cross functionally and regionally with SRE/Dev/QA/PM teams to handle incidents and improve processes. - Manage and prioritize tasks/projects for high productivity and precise deliveries.
Minimum Qualifications: -Bachelor's degree in Computer Science, a related field, or equivalent practical experience. -demonstrated experience in software development with one or more programming languages. -experience in Linux Operating Systems, Networking, Database concepts, Monitoring, Shell scripting. -Superb analytical ability, problem solving and critical thinking skills. -Excellent communicator, team-player, self-starter and fast learner. Preferred Qualifications: -Master's degree in Computer Science, Engineering or a related field. -Proficient in any of the following languages: Python, GoLang, C++. -Expertise in any of the following: SRE philosophy, AIOPS, APM, Disaster Recovery. -Expertise in any of these tech stacks: Kubernetes, ElasticSearch, ClickHouse, Message Queue, OpenTSDB, Service Mesh. Candidates for this position must be legally authorized to work in the United States. This position is not eligible for visa sponsorship or support.
Location:
San Jose