Engineering Manager · TIDAL (Block Inc.)
SRE Operating Model Rollout
Reduced MTTR by 50% and lifted reliability scores by 80% across 14 product teams.
Context
TIDAL’s music streaming footprint spanned mobile, living room, and partner integrations, but SRE responsibilities were fragmented across 14 squads. Only a handful of services had meaningful SLOs, incident reviews, or ownership metadata, which made it difficult to prioritise resilience along with feature work.
My Role
As the Engineering Manager for TIDAL’s SRE group I was accountable for the roadmap, stakeholder alignment, and the coaching plan that helped every squad adopt SRE practices. I facilitated executive readouts on reliability metrics, unblocked budget for tooling, and paired with tech leads while they defined SLOs, runbooks, and automation guardrails.
What I Led
- Partnered with Engineering and Product leadership to tie SLOs, SLIs, runbooks, and service-tier classifications to business objectives and OKRs.
- Ran enablement workshops that walked teams through defining customer journeys, calculating error budgets, and building actionable dashboards.
- Embedded with squads to co-author runbooks and automate incident escalation paths through the IDP, so documentation lived where engineers already worked.
- Introduced post-incident coaching sessions plus DORA metrics reporting to keep continuous improvement visible to executives.
Results
- All 14 product teams adopted the model; incident MTTR dropped by half while platform reliability scores rose 80%.
- Service-tier clarity aligned investment with risk, enabling deliberate trade-offs during roadmap planning.
- Runbooks and automated incident flows are now part of the standard definition of done, giving new teams a proven playbook from day one.
Technologies & Tools
Technologies & Tools Used
DataDog
PagerDuty
Docker
AWS ECS
Terraform
Python
Go
AWS Secret Manager
AWS
DORA Metrics
Error Budgets
SLO/SLI