Platform Automation & Observability at PlayStation

Mission

PSNow’s cloud gaming service relied on globally distributed hardware with tight availability targets. Scaling both automation and telemetry was essential to keep up with subscriber demand.

My Role

As the Site Reliability Engineer leading both Platform Automation and Observability, I managed the team backlogs, coordinated with hardware ops, and acted as the glue between data-center technicians and the central SRE program. I set the automation priorities, reviewed designs, and personally paired with engineers on the observability pipelines that supported PSNow.

Contributions

Directed the Platform Automation team to script hardware bring-up, reducing time-to-production for new racks from days to hours.
Created tooling that validated firmware, network configs, and security baselines before automatically enrolling servers into capacity pools.
Ran the Observability team ingesting 300+ GB of metrics per day while standardising Prometheus/Grafana deployments and alerting contracts.
Built self-service APIs so service owners could provision dashboards, alerts, and synthetic tests without platform intervention.

Business Value

Operational toil dropped by 40%, freeing engineers to focus on higher-leverage improvements.
Observability coverage expanded to every data center, giving central SRE instant visibility and faster incident response.
Automation paved the way for future hybrid-cloud expansions because provisioning workflows were fully codified.

Technologies & Tools

Technologies & Tools Used

Ansible Terraform Packer Prometheus Grafana AlertManager ELK Docker Kubernetes Python Go AWS Hybrid Infrastructure Vault