Lead System Reliability Engineer
|Квалификация:||3+ years of IT operations/IT monitoring/DevOps|
Experience as a Team Lead.
Knowledge and familiarity with alerts & monitoring tools, and system management tools (likely but not limited by Grafana, Prometheus).
Knowledge and familiarity with logs collection and analyze systems like Splunk, ELK.
Experience in monitoring virtual and on-premises infrastructures
Ability to hold lots of interactions and troubleshoot highly complex error conditions.
Desire to work in a global company with HL distributed product
Development of the SRE team
|Задачи:||Develop and manage multiple teams of Reliability Engineers. Capacity planning, scheduling.|
Lead the development of department culture, processes, procedures, technologies.
Lead global initiatives and develop Reliability Strategy, to achieve excellence in system availability and product lifecycle management process.
Contribute in developing of all global projects/features in product before they go live through system design consulting, capacity planning, monitoring development, and launch reviews.
Lead in managing of lifecycle for services once they are live by measuring and monitoring availability, latency and overall system health across our cloud-hosted infrastructure.
Take part in the scaling of systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
Design and build systems to provide real-time operational insight for development and management teams.
Partner with development and engineering teams and leadership in it to promote best practices and provide advice on how to implement features that are instrumented and observable.
Generate, manage, and report the application performance data captured by the monitoring tools and proactively work with DevOps teams in resolving performance issues.