Senior Site Reliability Engineer, TikTok Embedded SRE
New Today
The team is responsible for supporting various businesses within TikTok, ensuring that their services are performing optimally, with a reasonable operational cost, using in-house and external tooling to achieve this. The team focuses on improving observability and operability of the various services, driven by data, ensuring stability of the business round the clock.
- Lead and mentor members of the team
- Participate in oncall of the service you are supporting, identifying common issues across services and finding a systematic solution to solve the issue at scale
- Participate in the construction of operation and maintenance tools and platforms, and identify strategies to tackle common problems
- Ensure stability of the service through proactive risk monitoring, and identifying new risks
- Identify key system risks through comprehensive data operations to review system health
- Accumulate best practices in operations, and propose new approaches to improve operations in TikTok
Minimum Qualifications
- Bachelor's Degree or above, Major in Computer Science or related majors - Solid basic knowledge of computer software; understand the relevant principles of Linux operating system, storage, network IO, etc.
- Familiar with one or more programming languages, such as Python/Go/Java/PHP/C/C++
- Have the ability to solve problems systematically, good communication skills, and a strong sense of responsibility
- 3 years of experience in designing, analyzing, and troubleshooting large-scale distributed systems​
- 2 years of experience leading projects and providing technical leadership. Preferred Qualifications
- Minimum 5+ years relevant work experience from a large-scale internet business
- Solid understanding of Docker, Kubernetes, or other container orchestration systems​
- Experience with developing, deploying, and/or maintaining micro-services architecture with Kubernetes
- Location:
- San Jose