Lead Site Reliability Engineer

Dmitry
Stepanov

14 years architecting and operating planetary-scale distributed systems — from 10M metrics/sec observability platforms to zero-toil automation.

Roseville, CA 95747
sre@apple-k8s ~ zsh

About

Who I am

Technical Lead with 20+ years in IT and 14 years of California-based experience spanning QA Automation, DevOps, and Site Reliability Engineering — from 5.25″ floppies and Windows 3.1 to planetary-scale Kubernetes clusters, progressively taking on ownership of larger, more complex systems at each stage. Currently at Apple, I lead the observability platform supporting 250+ internal engineering teams across 8 major business organisations — Maps, Retail, and beyond.

I thrive at the intersection of reliability and automation. Whether engineering a 6,000-node federated TSDB cluster or reducing onboarding toil from 120 to 20 minutes, I focus on building systems that scale gracefully and fail safely — and on creating the culture that keeps them that way.

CKA CKS CKAD LFCS Terraform Associate 5× AWS Certified Certified ScrumMaster
0
Years in IT
0
Endpoints monitored
0
Metrics/sec
0
Teams supported
0
Platform SLO

Experience

Where I've worked

Lead Site Reliability Engineer
Apple via Infosys
10/2020 — Present · Roseville, CA
  • Planetary Scale Monitoring: Orchestrated one of the world's largest monitoring ecosystems (Epic/Mosaic), managing 10M+ data points/sec for 800,000+ global services.
  • Observability Enablement: Led infrastructure strategy providing real-time visibility for 250+ internal teams across 8 business orgs, maintaining Tier-1 status through network partitions.
  • Synthetic Monitoring: Designed and built a Python/Docker-based suite for 700+ global API/UI endpoints, delivering high-resolution SLI/SLO metrics to executive leadership.
  • Critical Risk Mitigation: After a major K8s outage, engineered an automated x509 Prometheus monitoring solution — replacing 2-hour emergency bridge calls with 30-day proactive alerts.
  • Toil Elimination: Developed a Bash/Python automation framework for instance lifecycle management, cutting manual onboarding from 120 → 20 minutes, saving hundreds of engineering hours annually.
  • Team Leadership: Directed a distributed team of 6 SREs (onsite/offshore), maintaining 99.9% SLO for high-load telemetry platforms.
DevOps Engineer
Acoustic (IBM Divestiture)
04/2020 — 09/2020
  • Core engineer for the high-stakes migration of Tealeaf SaaS from SoftLayer to AWS EKS following IBM's divestiture of the Watson Customer Engagement unit.
  • Leveraged Terraform to provision secure, scalable EKS environments, ensuring 100% business continuity through the transition.
Advisory Software Engineer — DevOps
IBM
09/2015 — 04/2020
  • Built a global CI/CD pipeline from scratch using UrbanCode Deploy, improving deployment speed by 300% for the Tealeaf SaaS product.
  • Modified Slack plugins using Groovy to provide real-time deployment orchestration and automated feedback loops for engineering teams.
  • Managed IBM Globalization pipeline via Watson API to automate UI localisation for international markets.
Staff Software Engineer — QA / Performance
IBM
02/2014 — 09/2015
  • Directed load and stress testing for Digital Analytics SaaS using JMeter, identifying critical bottlenecks before peak traffic events.
  • Served as Feature Lead for DDX, coordinating cross-functional sprints from kickoff to production delivery.
Software Engineer / QA
Quisk / Zultys
11/2011 — 02/2014
  • Java-based test automation (JUnit, Selenium), API security testing, and mobile platform (iOS/Android) stability in the fintech and VoIP sectors.

Skills

Technical stack

Cloud & Infrastructure
AWS (5×)GCPIBM Cloud Bare-MetalRHELUbuntu
Orchestration & IaC
KubernetesDockerHelm TerraformAnsiblePuppet
Observability
PrometheusGrafanaThanos SplunkELKx509-exporterProber
Languages
PythonBashJava GroovyRuby
CI/CD & Delivery
Jenkins (CloudBees)GitHub Actions UrbanCode DeploySpinnakerArtifactory
SRE Practices
SLI/SLOError Budgets Incident ResponseChaos EngineeringToil Reduction

Education

Background

BS in Engineering & Economics
Belarusian National Technical University