Lead Monitoring Systems Engineer - Remote

New Today

Description and Requirements
CareerArc Code CA-BS #LI-BS1 "At BMC trust is not just a word - it's a way of life!" We are an award-winning, equal opportunity, culturally diverse, fun place to be. Giving back to the community drives us to be better every single day. Our work environment allows you to balance your priorities, because we know you will bring your best every day. We will champion your wins and shout them from the rooftops. Your peers will inspire, drive, support you, and make you laugh out loud! We help our customers free up time and space to become an Autonomous Digital Enterprise that conquers the opportunities ahead - and are relentless in the pursuit of innovation! Our SaaS Ops department focuses on delivering SaaS excellence and a great SaaS experience for our customers. We continuously grow by adding and implementing the most cutting-edge technologies and investing in Innovation! Our team is a global and versatile group of professionals, so if you’re looking for a place where your ideas will be heard – this is the place for you! We are seeking a highly skilled and experienced Lead Monitoring Administrator to join our team. The ideal candidate will have extensive expertise in BMC monitoring applications like TrueSight Operations Management, BMC Helix Operations Management, BMC Patrol, etc. The ideal candidate will be responsible for developing, implementing, and maintaining robust monitoring solutions to ensure the health, performance, and reliability of our applications and infrastructure. This role requires a blend of software development, system administration, and operational support skills. Key Responsibilities: Monitoring Strategy Development: Develop and implement comprehensive monitoring strategies for our infrastructure and applications. Lead the design, deployment, and maintenance of monitoring solutions using TrueSight, BMC Helix Operations Management, and Prometheus. System Administration: Administer and maintain monitoring tools to ensure optimal performance and availability. Ensure coverage of all critical systems and applications with appropriate alerting thresholds. Performance Analysis and Optimization: Enhance AIOps within the ecosystem for better observability. Optimize monitoring configurations to reduce false positives and improve the accuracy of alerts. Leadership and Collaboration: Lead a team of monitoring administrators, providing guidance, mentorship, and training. Collaborate with cross-functional teams to integrate monitoring solutions with existing IT and development workflows. Documentation and Reporting: Develop and maintain comprehensive documentation for monitoring configurations, processes, and procedures. Generate and present regular reports on system performance, incidents, and resolutions to senior management. Required Skills and Qualifications: Technical Expertise: Extensive experience with TrueSight, BMC Helix Operations Management, and Prometheus. Strong knowledge of monitoring principles, practices, and tools. Proficiency in scripting languages such as Python. Any relational database knowledge. Experience: Minimum of 8 years of experience in infrastructure monitoring and administration. Proven experience in leading and managing monitoring teams. Familiarity with log monitoring solutions and log analytics like clickhouse, kibana, etc. Implemented network monitoring using any NPM tools. Analytical Skills: Strong analytical and problem-solving skills with the ability to diagnose and resolve complex technical issues. Ability to interpret and analyse system performance data to drive continuous improvement. Communication Skills: Excellent verbal and written communication skills. Ability to convey complex technical information to both technical and non-technical stakeholders. Nice to have experience. Hands-on experience with cloud environments (AWS, OCI, GCP) and containerization technologies (Docker, Kubernetes) is a plus. Relevant certifications such as BMC Certified Professional, Certified Kubernetes Administrator (CKA), or similar are preferred. Experience with PromQL. #LI-Remote
Location:
: United States