Summary
Principal-level platform architect and reliability leader with 17+ years designing, building, and operating enterprise systems at scale. Leads platform and production reliability engineering for a portfolio of 600+ payments and risk applications processing 8K+ transactions per second, and architected an enterprise AI-powered operations platform that unifies application lifecycle management, SRE automation, and intelligent operations. Sets technical vision and engineering strategy, partners across 15+ teams and 5–7 organizations (100+ engineers), drives org-wide GenAI and reliability enablement, and turns ambiguous, high-stakes problems into secure, observable, highly available systems. Deep, hands-on expertise across agentic AI and LLM orchestration, cloud-native platform engineering, policy-as-code governance, security architecture, and Site Reliability Engineering.
Leadership & Strategy
- Set multi-year technical vision and platform strategy across a 600+ application payments & risk portfolio, aligning leadership across 15+ teams and 5–7 organizations.
- Lead and influence 100+ engineers across cross-functional teams from concept to production; mentor senior/staff engineers and set org-wide architecture, delivery, and reliability standards.
- Scaled and matured the SRE practice — incident command, SLOs, and error budgets — shifting culture to data-driven reliability.
- Launched an org-wide GenAI & coding enablement program, building engineering capability while preserving production-first controls and AI governance.
- Trusted incident commander for mission-critical, globally distributed payment systems; partner to executives on cost, risk, and delivery (DORA).
Core Expertise
Key Projects & Initiatives
Click any project to expand details.
Architected and led a 10+ service, event-driven platform (control plane, AI/intelligence gateway, context & knowledge service, policy/governance service, execution/runtime, observability/signal, MCP tool server, security guard, and experience layers). Designed an agentic AI layer of 35+ AI agents — 5+ domain agents, 10+ workflow agents, plus orchestration, investigation, and reasoning agents — coordinated by a LangGraph incident-resolution orchestrator and a ReAct reasoning loop, over a 150+ tool MCP integration layer across 25+ domains, a RAG context/knowledge service, and OPA policy-as-code governance with auditable decision trails.
Impact: cut incident MTTR 82% (45→8 min), removed 2,000+ engineer-hours of monthly toil, consolidated 30+ tools, drove $38M+ in annual savings, and sustained six-nines availability — all with zero-trust identity and strict multi-tenancy.
Defined the vision, curriculum, and ways-of-working for an organization-wide enablement program that builds coding competency and internal-tooling ownership across the reliability-engineering org, while preserving production-first separation-of-duties and AI governance. Established GitHub-based collaboration, reusable patterns, and office hours for adoption.
Impact: introduced GenAI as a responsible force multiplier for analysis, documentation, and automation — improving speed, consistency, and quality of operational work and seeding a culture of engineering-led reliability.
Designed and built an autonomous software-delivery pipeline on the Anthropic Agent SDK that coordinates five specialized agents — Architect → two parallel Developers → Bug Hunter → Reviewer — through a multi-phase state machine with parallel fan-out, adversarial review, and human-in-the-loop gates.
Full observability and persistence via MySQL and a Redis vector index (RAG over the codebase), a React dashboard for live phase/task tracking, and a Docker-first, multi-language runtime (Python, Ruby, Node). Demonstrates production-grade agentic orchestration patterns end-to-end.
Built a scenario-based model and interactive tool that translates operational metrics into recommended resourcing and staffing across application lifecycle stages (new, growth, mature, legacy, and global footprints). Gives leadership directional guidance and guardrails for capacity and workforce planning across a large application portfolio, with a searchable metrics dictionary and exportable scenarios.
Contributed to security zoning architecture and network segmentation design and policy review across multi-zone payment networks (perimeter, business, and restricted zones), aligning application connectivity with security-control requirements and compliance. Partnered with security and network teams on segmentation policy and safe connectivity patterns for new and existing services.
Impact: reduced lateral-movement risk and accelerated compliant onboarding of new services into segmented production zones.
Level-3 engineering and platform stewardship across a 600+ application payments & risk portfolio processing 8K+ transactions per second — peak-season capacity planning, annual disaster-recovery and datacenter-migration exercises, release/manifest coordination, vulnerability remediation, and deep observability (APM, distributed tracing, logs, synthetic/network monitoring).
Built internal self-service portals, deployment and token management, and automation adopted across the organization; led containerization and orchestration (Docker, Kubernetes, OpenShift), modernizing legacy middleware onto cloud-native platforms.
Designed and implemented automated failover and self-healing for critical applications driven by advanced monitoring, plus network- and load-balancer-level failover across data centers. Built rapid traffic-steering / "kill-switch" controls and led annual disaster-recovery and database-switch exercises.
Impact: enabled zero-impact maintenance and fast, predictable recovery for globally distributed payment systems, materially reducing downtime risk during peak season and incidents.
Designed fleet-wide certificate discovery and inventory with proactive expiry and weak-cryptography alerting routed directly to application owners. Closed a significant operational and security risk gap with near-complete coverage and timely, tracked remediation.
Professional Experience
- Lead platform and production reliability engineering for 600+ payments & risk applications processing 8K+ transactions/second, partnering across 15+ teams and 5–7 organizations (100+ engineers) — owning performance, release, reliability, and security posture.
- Architected and delivered an enterprise AI-powered operations platform (10+ services; 35+ AI agents, 150+ tool MCP layer, RAG, OPA policy-as-code), cutting MTTR 82%, removing 2,000+ hrs/month of toil, and driving $38M+ in annual savings.
- Launched an org-wide GenAI & coding enablement program and scaled the SRE practice — SLOs, error budgets, incident command — raising reliability and engineering maturity across the org.
- Built internal self-service platforms, deployment/token management, and automation; led containerization and orchestration (Docker, Kubernetes, OpenShift) and legacy middleware modernization.
- Owned peak-season capacity planning, annual disaster-recovery and datacenter-migration exercises, release/manifest coordination, and vulnerability remediation; contributed to security zoning & network segmentation architecture.
- Provided 24/7 incident command and reliability leadership for mission-critical, globally distributed payment systems.
- Automated deployment, monitoring, and backup across large-scale infrastructure; led migration of legacy systems toward cloud- and SRE-ready platforms.
- Implemented performance monitoring and capacity planning for resilient, high-availability operations.
- Automated provisioning, patching, and backup for mission-critical applications; optimized high-availability clusters and disaster-recovery posture.
- Drove early SRE initiatives: incident response and proactive system-health monitoring.
- Built and maintained research and departmental web platforms; supported campus infrastructure across storage, networking, and security.
Certifications
- Management Essentials — Harvard Business School
- Leadership & Management — Harvard Business School
- Disruptive Strategy — Harvard Business School
- VMware Certified Professional (VCP)
- Brainbench Certified Unix Administrator
Education
- M.S., Computer Science
Texas A&M University, Commerce, TX - B.E., Engineering
JNT University, India