Dmitry
Stepanov
14 years architecting and operating planetary-scale distributed systems — from 10M metrics/sec observability platforms to zero-toil automation.
Who I am
Technical Lead with 20+ years in IT and 14 years of California-based experience spanning QA Automation, DevOps, and Site Reliability Engineering — from 5.25″ floppies and Windows 3.1 to planetary-scale Kubernetes clusters, progressively taking on ownership of larger, more complex systems at each stage. Currently at Apple, I lead the observability platform supporting 250+ internal engineering teams across 8 major business organisations — Maps, Retail, and beyond.
I thrive at the intersection of reliability and automation. Whether engineering a 6,000-node federated TSDB cluster or reducing onboarding toil from 120 to 20 minutes, I focus on building systems that scale gracefully and fail safely — and on creating the culture that keeps them that way.
Where I've worked
- Planetary Scale Monitoring: Orchestrated one of the world's largest monitoring ecosystems (Epic/Mosaic), managing 10M+ data points/sec for 800,000+ global services.
- Observability Enablement: Led infrastructure strategy providing real-time visibility for 250+ internal teams across 8 business orgs, maintaining Tier-1 status through network partitions.
- Synthetic Monitoring: Designed and built a Python/Docker-based suite for 700+ global API/UI endpoints, delivering high-resolution SLI/SLO metrics to executive leadership.
- Critical Risk Mitigation: After a major K8s outage, engineered an automated x509 Prometheus monitoring solution — replacing 2-hour emergency bridge calls with 30-day proactive alerts.
- Toil Elimination: Developed a Bash/Python automation framework for instance lifecycle management, cutting manual onboarding from 120 → 20 minutes, saving hundreds of engineering hours annually.
- Team Leadership: Directed a distributed team of 6 SREs (onsite/offshore), maintaining 99.9% SLO for high-load telemetry platforms.
- Core engineer for the high-stakes migration of Tealeaf SaaS from SoftLayer to AWS EKS following IBM's divestiture of the Watson Customer Engagement unit.
- Leveraged Terraform to provision secure, scalable EKS environments, ensuring 100% business continuity through the transition.
- Built a global CI/CD pipeline from scratch using UrbanCode Deploy, improving deployment speed by 300% for the Tealeaf SaaS product.
- Modified Slack plugins using Groovy to provide real-time deployment orchestration and automated feedback loops for engineering teams.
- Managed IBM Globalization pipeline via Watson API to automate UI localisation for international markets.
- Directed load and stress testing for Digital Analytics SaaS using JMeter, identifying critical bottlenecks before peak traffic events.
- Served as Feature Lead for DDX, coordinating cross-functional sprints from kickoff to production delivery.
- Java-based test automation (JUnit, Selenium), API security testing, and mobile platform (iOS/Android) stability in the fintech and VoIP sectors.