GOES → Hybrid Cloud — A Customer Architect's View

Elastic · Customer Architect Panel Interview01 / 16

Hybrid Cloud Transformation · E-Commerce & Streaming

GOES → Hybrid Cloud
A Customer Architect's
View

Search, solve, and succeed — at the scale of a global e-commerce and streaming business. The architecture, the migration, and a live Elastic demo running right now on a separate host.

CandidateAnthony G. Tellez

ClientGOES — Global e-commerce + streaming

PanelApp Dev · EA · VP Eng · Business

RoleCustomer Architect — Elastic

Agenda02 / 16

90-minute narrative. Two parts. Four audiences.

Part 1 is the cloud-transformation story. Part 2 is a live demo of the Elastic Stack against a working apache + APM pipeline I have running on a DigitalOcean droplet right now.

Time

Topic

Primarily for

0–3 min

Who I am — and why I've seen this problem from the other side

All panel

3–8 min

GOES today — and what can't stay the way it is

All panel

8–18 min

Target architecture · cloud provider · why AWS

Enterprise Architect

18–28 min

Migration plan · 4 phases · risks & mitigations

VP Engineering

28–35 min

Cost · security · scale · data — the four dimensions

Business · CISO

35–78 min

Live demo on Elastic Cloud — ingest, search, APM, alerts

All panel

78–90 min

Stakeholder outcomes · 90-day plan · Q&A

All panel

App Stack Developer

CI/CD velocity, local dev parity, observability of their own services

Enterprise Architect

Integration patterns, standards, vendor lock-in, data gravity

VP Engineering

Velocity, reliability, cost model, team impact, rollback confidence

Business

Revenue impact, customer experience, regulatory exposure, ROI

About the candidate03 / 16

10 years in enterprise pre-sales — 7 of them as Splunk's Global SE Architect.

The through-line: enterprise pre-sales, technical workshops, large-scale transformation advisory, and the patience to sit with a customer's engineers until the product actually works for them.

10+

Years in enterprise pre-sales & solution engineering

7 yrs

Splunk Principal Architect — ML & AI, 2014–2021

$35→$411M

Splunk FY18–FY21 revenue era I was in the room for

1

USPTO patent · ML malware domain classification

Crogl

Principal Forward Deployed Engineer · security automation + AI governance

BNY Mellon

SVP, Data & Intelligence Engineering · LLM for regulatory triage · SuriCon 2024 keynote

Palo Alto Networks

Cloud Security Architect, Prisma Cloud · competitive POCs · cloud transformation advisory

BlockFi

Sr. Security Architect, Head of ML · ML platform for 100+ data scientists · $252M suspicious flows

Splunk

Principal Architect, Global SE Architect ML & AI · co-authored ML Toolkit · 7 years, 2014–2021

Why this maps to the exercise: Part 1 is the cloud-transformation advisory I've been running for a decade. Part 2 is hands-on Elastic — the observability story I'd thread through the real migration.

Anthony G. Tellez

Mesa, AZ · anthonygtellez.com

The situation04 / 16

GOES runs two self-managed data centers connected by a single VPN

Today GOES serves customers on two continents from physical data centers in North America and Europe. Bi-directional replication over VPN keeps them roughly in sync. The business is running — and the architecture is the reason every improvement takes a quarter.

Migration objectives05 / 16

The five things the briefing asks me to address

Every slide that follows maps back to one or more of these. The six attributes on the right are the lens through which every decision has to pass.

#	Objective	Covered in	Primary audience
1	Cloud-native patterns supporting on-prem → cloud transition	§7 Target arch · §8 Phases	App Dev · EA
2	Seamless functioning in a hybrid environment	§8 Phases · §9 Risks	EA · VP Eng
3	Potential risks & mitigations	§9 Risks · dual-run pattern	VP Engineering
4	Opportunities vs. self-managed on-prem	§6 Constraints · §10 Four dimensions	Business
5	Cost · security · scalability · data management	§10 Four dimensions	Business · CISO

Attributes shaping every decision

Size & scale

Multinational · millions of users · substantial daily data

Data privacy & compliance

GDPR in EU — residency enforced, not assumed

Complexity of services

Shopping · streaming · payments in one stack

Peak periods

Black Friday / Christmas — no disruption tolerated

Business continuity

HA & resiliency — downtime equals revenue loss

Innovation mindset

Rapid changes & deployments — culture of experimentation

Current-state constraints06 / 16

Three constraints the architecture imposes today

Reading the current stack against the briefing's attributes (scale, peak, GDPR, BC/DR, innovation), three constraints surface immediately. Every architectural decision in Part 1 traces back to one of them.

Capacity locked to worst case

Hardware is provisioned for the Black Friday / Christmas peak and runs idle the rest of the year. Capex is committed against the annual spike, not the steady-state load.

Ties directly to the peak-period and BC/DR attributes.

Cross-region dependency on the VPN

Bi-directional replication between NA and EU rides a single VPN link. Every cross-region feature — and the regulatory-compliant write path — lives or dies by that one connection.

Ties to scale, GDPR residency, and business continuity.

Change cadence bounded by on-prem

Deploy velocity is capped by the slowest physical-infra change window. The briefing's "innovation mindset" can't be expressed while every release clears a change-advisory board.

Ties to the innovation and complexity-of-services attributes.

"Significant spikes in traffic … downtime or service disruption could result in substantial revenue losses … the architecture needs to support rapid changes and deployments." — from the briefing's "Considerations and attributes"

Target architecture07 / 16

Target state: hybrid AWS across two regions with Aurora Global and Direct Connect

One end-state diagram. On-prem is retained as warm DR through the transition. Three years from now, the only thing still running in the DC is a rack of Direct Connect gear.

Why AWS — verified service availability

All required services GA in both regions: Aurora Global, EKS, CloudFront, MediaConvert, MediaLive, DynamoDB Global Tables, ElastiCache, Direct Connect. Elastic Cloud available via AWS Marketplace in both.

Compliance + AZ resilience

PCI DSS Level 1 + GDPR. us-east-1 has 6 AZs (industry max) — multi-AZ Aurora + EKS spread across 3+ AZs per region. eu-west-1 in Ireland with GDPR-scoped data residency.

Cost model

Savings Plans for baseline, Spot for peak burst. Cost scales with revenue — not with the calendar. Data egress design-for-locality first.

Alternatives considered

Azure if GOES is Microsoft-licensed. GCP if ML workloads dominate. The application code is portable either way — containers and open standards mean switching cloud providers doesn't require rewriting the app.

Migration plan08 / 16

36-week migration in four phases with a peak-season freeze

Phase 0 stands up observability before anything moves — you cannot prove "the cloud version is better" if you never measured the on-prem baseline.

wk 0

wk 4

wk 10

wk 20

wk 28

wk 36

Phase 0

Foundations

Phase 1

Stateless lift · EKS + CloudFront

Phase 2

Data path · Aurora Global + DMS + CDC

FREEZE

Nov – Jan · no cutovers

Phase 3

Media · MediaConvert + S3 origins

Phase 4

Steady state · FinOps · DR drills

Phase 1 · weeks 4–10

Stateless lift

Containerize web + API. EKS in NA. Reads hit on-prem DBs over Direct Connect. Rollback is instant.

Phase 2 · weeks 10–20

Data path

Aurora Global. DMS + reverse CDC so every write is rollback-able for 72h.

Phase 3 · weeks 20–28

Media & streaming

MediaConvert + MediaLive + MediaPackage. S3 origins per region. CloudFront at the edge.

Phase 4 · weeks 28–36

Steady state

FinOps commits. Quarterly DR drills. On-prem becomes warm DR only. SLOs replace raw uptime.

Risks & mitigations09 / 16

Top risks and how each phase stays rollback-able

The top 6 from a 10-row register. Every risk has a mitigation and an early-warning signal. The dual-run pattern on the right is how we keep the exit door open in every phase.

Risk	Likelihood	Impact	Mitigation
Peak-season migration collision	High	High	Hard Nov–Jan freeze. Phase gates enforce it.
GDPR residency violation	Med	Critical	Policy-as-code SCPs block forbidden flows.
Cost blow-up from lift-and-shift	High	Med	Re-platform. FinOps guardrails from day 0.
Cutover rollback impossibility	Med	High	Dual-run with weighted DNS. Reverse CDC 72h.
PCI scope expansion	Med	High	Dedicated account/VPC. Tokenize at edge.
Streaming latency regression (EU)	Med	High	Keep EU origin through transition. Load-test first.

How we keep the exit door open

Every phase runs the old and new paths in parallel. Traffic shifts gradually via weighted DNS — 10% → 25% → 50% → 90% → 100%. At every gate, SLOs must stay green for 24–48h. If they don't, the weight reverts. The old path never goes away until the new one has proven itself under real production load.

PHASE 1 · STATELESS

Web + API to EKS. On-prem stays live. If latency regresses at any weight, Route 53 flips all traffic back to on-prem in <60s. No data migration involved — safest phase to start with.

PHASE 2 · DATA

DB writes move to Aurora Global. Reverse CDC keeps the on-prem DB in sync for 72h after cutover — if anything breaks, we promote on-prem back to authoritative and zero writes are lost.

PHASE 3 · MEDIA

Streaming pipeline to S3 + CloudFront. CDN origin group keeps the legacy origin as a fallback — one config change reverts all viewers to the old stream path.

The takeaway: at no point during this 36-week migration is GOES in a state where failure means downtime. Every cutover is a reversible experiment, not a one-way door.