SRE Operating Model Rollout

Context

TIDAL’s music streaming footprint spanned mobile, living room, and partner integrations, but SRE responsibilities were fragmented across 14 squads. Only a handful of services had meaningful SLOs, incident reviews, or ownership metadata, which made it difficult to prioritise resilience along with feature work.

My Role

As the Engineering Manager for TIDAL’s SRE group I was accountable for the roadmap, stakeholder alignment, and the coaching plan that helped every squad adopt SRE practices. I facilitated executive readouts on reliability metrics, unblocked budget for tooling, and paired with tech leads while they defined SLOs, runbooks, and automation guardrails.

What I Led

Partnered with Engineering and Product leadership to tie SLOs, SLIs, runbooks, and service-tier classifications to business objectives and OKRs.
Ran enablement workshops that walked teams through defining customer journeys, calculating error budgets, and building actionable dashboards.
Embedded with squads to co-author runbooks and automate incident escalation paths through the IDP, so documentation lived where engineers already worked.
Introduced post-incident coaching sessions plus DORA metrics reporting to keep continuous improvement visible to executives.

Results

All 14 product teams adopted the model; incident MTTR dropped by half while platform reliability scores rose 80%.
Service-tier clarity aligned investment with risk, enabling deliberate trade-offs during roadmap planning.
Runbooks and automated incident flows are now part of the standard definition of done, giving new teams a proven playbook from day one.

Technologies & Tools

Technologies & Tools Used

DataDog PagerDuty Docker AWS ECS Terraform Python Go AWS Secret Manager AWS DORA Metrics Error Budgets SLO/SLI