The Centre of Expertise (CoE)for Site Reliability Engineering (SRE)supports the organization’s strategy by enabling SRE capabilities towards continuous focus on system health, reliability, availability, capacity, performance, continuity, and management of IT services.Excellent opportunity for a Platform Operations Lead, to join our SRE COE team for this newly created position reporting to Site Reliability Engineering Lead. You’ll play a critical role in managing and maintaining our organisations observability and incident response technology infrastructure and services. You’ll be overseeing the operation and performance of our Observability Platforms (Splunk, Grafana stack, PagerDuty) to ensure reliability, scalability, and security.This role combines technical expertise, leadership, and strategic thinking to drive operational excellence. It requires collaboration with large set of stakeholders across infrastructure, security, platform engineering and SRE to ensure applications run smoothly and are scalable. It is a hands-on, multi-skilled role that touches application lifecycle management, technical design, technical testing and infrastructure.What you’ll do…
- Lead the operation, maintenance, and optimization of our current and future Observability Platforms, to ensure 24/7 availability and reliability.
- Lead incident response efforts, including root cause analysis, resolution tracking and post-incident reviews.
- Develop, track and report on SLI/SLO and key performance indicators for the Observability Platforms.
- Mentor and lead a team of Platform Operations Devops/Engineers.
- Collaborate with security teams to enforce best practices, maintain platform security, and address vulnerabilities.
- Plan and manage a backlog of support work includes but not limited to incident response, defects, vulnerabilities, and security/risk related documentation.
- Solid knowledge of running/maintaining Splunk, Prometheus, Grafana application and infrastructure.
- Demonstrated experience with running as small operations function within a broader technology organisation.
- Good working knowledge of Container Technologies Docker, OpenShift, Kubernetes.
- Good working knowledge of virtualization technologies-VM Ware environments
- Good working knowledge of Azure Cloud technologies (AzureDevops, Azure Cloud etc).
- Experienced in one or more scripting languages-PowerShell, Python, BASH.
- Discounted ING Health Insurance
- An additional Rest Day to support your wellbeing.
- This role is NSW based. ING’s approach to flexible working (FlexING) means you can work 50/50 in office (Sydney) and from home.