Site Reliability Engineer, Infrastructure and Assurance Services

Site Reliability Engineer, Infrastructure and Assurance Services - USDS

New Yesterday

The Infra SRE-Infrastructure-Assurance team extends TikTok infrastructure's operability, observability, visibility, and automation. We aim to provide holistic insights and solutions to TikTok infrastructure with minimal manual interventions. We're young and fast-growing. Our team values transparency, collaboration, hard-work and innovations. We believe in planning and long-term strategies rather than short-term gains. Join us in solving large-scale complex issues in a hyper-growth team; Embracing challenges with a fearless curious mind. In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time. Responsibilities: - Perform SRE duties and operations on supported services in production, including but not limited to: on-call rotations, maintenance, change management, monitoring, incident response, capacity planning, disaster recovery. - Maximize system uptime, availability and stability, to ensure functional and performance SLAs. - Contribute to existing documentations and build effective documentations such as operational runbooks, SOPs, SLA/SLO. - Initiate and lead scripting/tooling/automation to streamline processes and minimize human resource. - Work cross functionally and regionally with SRE/Dev/QA/PM teams to handle incidents and improve processes. - Manage and prioritize tasks/projects for high productivity and precise deliveries.

Minimum Qualifications: -Bachelor's degree in Computer Science, a related field, or equivalent practical experience. -demonstrated experience in software development with one or more programming languages. -experience in Linux Operating Systems, Networking, Database concepts, Monitoring, Shell scripting. -Superb analytical ability, problem solving and critical thinking skills. -Excellent communicator, team-player, self-starter and fast learner. Preferred Qualifications: -Master's degree in Computer Science, Engineering or a related field. -Proficient in any of the following languages: Python, GoLang, C++. -Expertise in any of the following: SRE philosophy, AIOPS, APM, Disaster Recovery. -Expertise in any of these tech stacks: Kubernetes, ElasticSearch, ClickHouse, Message Queue, OpenTSDB, Service Mesh. Candidates for this position must be legally authorized to work in the United States. This position is not eligible for visa sponsorship or support.

Apply

Location:: San Jose