Senior Site Reliability Engineer, TikTok Embedded SRE

New Today

The team is responsible for supporting various businesses within TikTok, ensuring that their services are performing optimally, with a reasonable operational cost, using in-house and external tooling to achieve this. The team focuses on improving observability and operability of the various services, driven by data, ensuring stability of the business round the clock. - Lead and mentor members of the team - Participate in oncall of the service you are supporting, identifying common issues across services and finding a systematic solution to solve the issue at scale - Participate in the construction of operation and maintenance tools and platforms, and identify strategies to tackle common problems - Ensure stability of the service through proactive risk monitoring, and identifying new risks - Identify key system risks through comprehensive data operations to review system health - Accumulate best practices in operations, and propose new approaches to improve operations in TikTok
Minimum Qualifications - Bachelor's Degree or above, Major in Computer Science or related majors - Solid basic knowledge of computer software; understand the relevant principles of Linux operating system, storage, network IO, etc. - Familiar with one or more programming languages, such as Python/Go/Java/PHP/C/C++ - Have the ability to solve problems systematically, good communication skills, and a strong sense of responsibility - 3 years of experience in designing, analyzing, and troubleshooting large-scale distributed systems​ - 2 years of experience leading projects and providing technical leadership. Preferred Qualifications - Minimum 5+ years relevant work experience from a large-scale internet business - Solid understanding of Docker, Kubernetes, or other container orchestration systems​ - Experience with developing, deploying, and/or maintaining micro-services architecture with Kubernetes
Location:
San Jose

We found some similar jobs based on your search