Elastic · Customer Architect Panel Interview01 / 16
Hybrid Cloud Transformation · E-Commerce & Streaming
GOES → Hybrid Cloud
A Customer Architect's
View
Search, solve, and succeed — at the scale of a global e-commerce and streaming business. The architecture, the migration, and a live Elastic demo running right now on a separate host.
CandidateAnthony G. Tellez
ClientGOES — Global e-commerce + streaming
PanelApp Dev · EA · VP Eng · Business
RoleCustomer Architect — Elastic
Agenda02 / 16
90-minute narrative. Two parts. Four audiences.
Part 1 is the cloud-transformation story. Part 2 is a live demo of the Elastic Stack against a working apache + APM pipeline I have running on a DigitalOcean droplet right now.
Time
Topic
Primarily for
0–3 min
Who I am — and why I've seen this problem from the other side
All panel
3–8 min
GOES today — and what can't stay the way it is
All panel
8–18 min
Target architecture · cloud provider · why AWS
Enterprise Architect
18–28 min
Migration plan · 4 phases · risks & mitigations
VP Engineering
28–35 min
Cost · security · scale · data — the four dimensions
Business · CISO
35–78 min
Live demo on Elastic Cloud — ingest, search, APM, alerts
All panel
78–90 min
Stakeholder outcomes · 90-day plan · Q&A
All panel
App Stack Developer
CI/CD velocity, local dev parity, observability of their own services
Enterprise Architect
Integration patterns, standards, vendor lock-in, data gravity
VP Engineering
Velocity, reliability, cost model, team impact, rollback confidence
Business
Revenue impact, customer experience, regulatory exposure, ROI
About the candidate03 / 16
10 years in enterprise pre-sales — 7 of them as Splunk's Global SE Architect.
The through-line: enterprise pre-sales, technical workshops, large-scale transformation advisory, and the patience to sit with a customer's engineers until the product actually works for them.
10+
Years in enterprise pre-sales & solution engineering
7 yrs
Splunk Principal Architect — ML & AI, 2014–2021
$35→$411M
Splunk FY18–FY21 revenue era I was in the room for
1
USPTO patent · ML malware domain classification
Crogl
Principal Forward Deployed Engineer · security automation + AI governance
BNY Mellon
SVP, Data & Intelligence Engineering · LLM for regulatory triage · SuriCon 2024 keynote
Palo Alto Networks
Cloud Security Architect, Prisma Cloud · competitive POCs · cloud transformation advisory
BlockFi
Sr. Security Architect, Head of ML · ML platform for 100+ data scientists · $252M suspicious flows
Splunk
Principal Architect, Global SE Architect ML & AI · co-authored ML Toolkit · 7 years, 2014–2021
Why this maps to the exercise: Part 1 is the cloud-transformation advisory I've been running for a decade. Part 2 is hands-on Elastic — the observability story I'd thread through the real migration.
Anthony G. Tellez
Anthony G. Tellez
Mesa, AZ · anthonygtellez.com
The situation04 / 16
GOES runs two self-managed data centers connected by a single VPN
Today GOES serves customers on two continents from physical data centers in North America and Europe. Bi-directional replication over VPN keeps them roughly in sync. The business is running — and the architecture is the reason every improvement takes a quarter.
GOES current state: two self-managed DCs with VPN replication NA CUSTOMERS EU CUSTOMERS GOES DATA CENTER · NORTH AMERICA Firewall Load Balancer Web API API Media self-managed Cust / Prod / Orders GDPR compliant GOES DATA CENTER · EUROPE Firewall Load Balancer Web API API Media self-managed Cust / Prod / Orders GDPR compliant VPN · BI-DIRECTIONAL
Migration objectives05 / 16
The five things the briefing asks me to address
Every slide that follows maps back to one or more of these. The six attributes on the right are the lens through which every decision has to pass.
#ObjectiveCovered inPrimary audience
1Cloud-native patterns supporting on-prem → cloud transition§7 Target arch · §8 PhasesApp Dev · EA
2Seamless functioning in a hybrid environment§8 Phases · §9 RisksEA · VP Eng
3Potential risks & mitigations§9 Risks · dual-run patternVP Engineering
4Opportunities vs. self-managed on-prem§6 Constraints · §10 Four dimensionsBusiness
5Cost · security · scalability · data management§10 Four dimensionsBusiness · CISO
Attributes shaping every decision
Size & scale
Multinational · millions of users · substantial daily data
Data privacy & compliance
GDPR in EU — residency enforced, not assumed
Complexity of services
Shopping · streaming · payments in one stack
Peak periods
Black Friday / Christmas — no disruption tolerated
Business continuity
HA & resiliency — downtime equals revenue loss
Innovation mindset
Rapid changes & deployments — culture of experimentation
Current-state constraints06 / 16
Three constraints the architecture imposes today
Reading the current stack against the briefing's attributes (scale, peak, GDPR, BC/DR, innovation), three constraints surface immediately. Every architectural decision in Part 1 traces back to one of them.
Capacity locked to worst case
Hardware is provisioned for the Black Friday / Christmas peak and runs idle the rest of the year. Capex is committed against the annual spike, not the steady-state load.

Ties directly to the peak-period and BC/DR attributes.
Cross-region dependency on the VPN
Bi-directional replication between NA and EU rides a single VPN link. Every cross-region feature — and the regulatory-compliant write path — lives or dies by that one connection.

Ties to scale, GDPR residency, and business continuity.
Change cadence bounded by on-prem
Deploy velocity is capped by the slowest physical-infra change window. The briefing's "innovation mindset" can't be expressed while every release clears a change-advisory board.

Ties to the innovation and complexity-of-services attributes.
"Significant spikes in traffic … downtime or service disruption could result in substantial revenue losses … the architecture needs to support rapid changes and deployments." — from the briefing's "Considerations and attributes"
Target architecture07 / 16
Target state: hybrid AWS across two regions with Aurora Global and Direct Connect
One end-state diagram. On-prem is retained as warm DR through the transition. Three years from now, the only thing still running in the DC is a rack of Direct Connect gear.
GOES target hybrid architecture with AWS service components GLOBAL EDGE Route 53 CloudFront + WAF Shield AWS US-EAST-1 (N. VIRGINIA) · PRIMARY 6 AZs · largest AWS region ALB EKS web · api · search Aurora Global PRIMARY · writes DynamoDB cart ElastiCache S3 media · logs PCI VPC tokenize AWS EU-WEST-1 (IRELAND) · ACTIVE 3 AZs · GDPR EU region ALB → EKS DynamoDB Global Table Aurora Global REPLICA · failover S3 — EU residency Aurora replication ELASTIC CLOUD · OBSERVABILITY PLANE Observability Security Elasticsearch · Kibana · Fleet · APM logs · metrics · traces · alerts · SIEM all telemetry → Elastic Cloud On-prem NA + EU — retained as warm DR Direct Connect
Why AWS — verified service availability
All required services GA in both regions: Aurora Global, EKS, CloudFront, MediaConvert, MediaLive, DynamoDB Global Tables, ElastiCache, Direct Connect. Elastic Cloud available via AWS Marketplace in both.
Compliance + AZ resilience
PCI DSS Level 1 + GDPR. us-east-1 has 6 AZs (industry max) — multi-AZ Aurora + EKS spread across 3+ AZs per region. eu-west-1 in Ireland with GDPR-scoped data residency.
Cost model
Savings Plans for baseline, Spot for peak burst. Cost scales with revenue — not with the calendar. Data egress design-for-locality first.
Alternatives considered
Azure if GOES is Microsoft-licensed. GCP if ML workloads dominate. The application code is portable either way — containers and open standards mean switching cloud providers doesn't require rewriting the app.
Migration plan08 / 16
36-week migration in four phases with a peak-season freeze
Phase 0 stands up observability before anything moves — you cannot prove "the cloud version is better" if you never measured the on-prem baseline.
wk 0
wk 4
wk 10
wk 20
wk 28
wk 36
Phase 0
Foundations
Phase 1
Stateless lift · EKS + CloudFront
Phase 2
Data path · Aurora Global + DMS + CDC
FREEZE
Nov – Jan · no cutovers
Phase 3
Media · MediaConvert + S3 origins
Phase 4
Steady state · FinOps · DR drills
Phase 1 · weeks 4–10
Stateless lift
Containerize web + API. EKS in NA. Reads hit on-prem DBs over Direct Connect. Rollback is instant.
Phase 2 · weeks 10–20
Data path
Aurora Global. DMS + reverse CDC so every write is rollback-able for 72h.
Phase 3 · weeks 20–28
Media & streaming
MediaConvert + MediaLive + MediaPackage. S3 origins per region. CloudFront at the edge.
Phase 4 · weeks 28–36
Steady state
FinOps commits. Quarterly DR drills. On-prem becomes warm DR only. SLOs replace raw uptime.
Risks & mitigations09 / 16
Top risks and how each phase stays rollback-able
The top 6 from a 10-row register. Every risk has a mitigation and an early-warning signal. The dual-run pattern on the right is how we keep the exit door open in every phase.
RiskLikelihoodImpactMitigation
Peak-season migration collisionHighHighHard Nov–Jan freeze. Phase gates enforce it.
GDPR residency violationMedCriticalPolicy-as-code SCPs block forbidden flows.
Cost blow-up from lift-and-shiftHighMedRe-platform. FinOps guardrails from day 0.
Cutover rollback impossibilityMedHighDual-run with weighted DNS. Reverse CDC 72h.
PCI scope expansionMedHighDedicated account/VPC. Tokenize at edge.
Streaming latency regression (EU)MedHighKeep EU origin through transition. Load-test first.
How we keep the exit door open
Every phase runs the old and new paths in parallel. Traffic shifts gradually via weighted DNS — 10% → 25% → 50% → 90% → 100%. At every gate, SLOs must stay green for 24–48h. If they don't, the weight reverts. The old path never goes away until the new one has proven itself under real production load.
PHASE 1 · STATELESS
Web + API to EKS. On-prem stays live. If latency regresses at any weight, Route 53 flips all traffic back to on-prem in <60s. No data migration involved — safest phase to start with.
PHASE 2 · DATA
DB writes move to Aurora Global. Reverse CDC keeps the on-prem DB in sync for 72h after cutover — if anything breaks, we promote on-prem back to authoritative and zero writes are lost.
PHASE 3 · MEDIA
Streaming pipeline to S3 + CloudFront. CDN origin group keeps the legacy origin as a fallback — one config change reverts all viewers to the old stream path.
The takeaway: at no point during this 36-week migration is GOES in a state where failure means downtime. Every cutover is a reversible experiment, not a one-way door.
Four dimensions10 / 16
Cost · Security · Scalability · Data management
Four axes the briefing calls out by name. One slide each would be an hour-long presentation — here's the ~2-minute version.
Cost
From calendar-bound to revenue-bound
Shift capex → opex. Savings Plans for the 70% baseline, Spot + on-demand for peak burst. FinOps tagging from day 0 so every cost lands in a cost-center. Weekly burn review; no quarterly surprises.

Data egress is the silent killer — design for locality first, replicate second.
Security
Identity-first, not network-first
SSO via IAM Identity Center. No long-lived keys anywhere in production after Phase 0. Zero-trust between services, not just at the edge. Secrets Manager / Parameter Store.

SIEM unifies AWS-native + application logs — one query surface. Compliance is enforced by policy-as-code, not quarterly audit.
Scalability
Horizontal first. Vertical as fallback.
Scale on leading indicators (queue depth, request rate), not lagging (CPU). Event-driven for long-tail workloads. Circuit breakers + bulkheads between services.

Chaos drills quarterly — validate elasticity before peak season, not during.
Data management
Classify first. Residency by policy, not geography.
Data classification at write time: public / internal / pii / pci. SCPs block cross-region replication for tagged data.

Backup + restore SLAs tested quarterly, not assumed. S3 lifecycle policies tier data down aggressively — your coldest data should cost you pennies.
Part 211 / 16
Data ingestion and analysis using the Elastic Stack
Live demonstration
~40 minutes walking through metrics collection, log ingestion, APM tracing, dashboards, and alerting — against a rig running on a separate Linux host.
Implementation architecture12 / 16
Data collection topology for Exercise 1
Three data pipelines running on a DigitalOcean droplet ship to Elastic Cloud. System metrics via Elastic Agent, apache logs via Logstash, and APM traces from Spring Pet Clinic.
Part 2 data-collection architecture: DigitalOcean droplet → Elastic Cloud YOUR MAC — CONTROL PLANE SSH to droplet · Browser to Kibana ssh https DIGITALOCEAN DROPLET · NYC3 · UBUNTU 24.04 elastic-panel-demo · 104.236.43.21 elastic-agent.service Fleet-enrolled as root · System integration · logs OFF cpu · memory · process · disk · network · filesystem · load logstash-panel.service file input → grok COMBINEDAPACHELOG → geoip → ES output 10,000 docs in apache-access-* · geo-enriched petclinic.service Spring Boot + JDK 17 + Elastic APM Java agent 1.52.0 :8080 · service.name=petclinic · 1,500+ transactions traffic.sh → curl loop against :8080 ELASTIC CLOUD · GCP US-CENTRAL1 Elasticsearch 9.3.3 · 3 nodes · green Elasticsearch metrics-system.* apache-access-* traces-apm-* panel-demo-alerts Fleet Server panel-demo-local APM Server elastic-cloud-apm Kibana — Discover · Dashboards · APM · Alerts Alert rules (metric threshold) CPU > 50% · memory > 85% → Server Log + Index actions Mustache template → panel-demo-alerts metrics checkin bulk API APM traces
Elastic Cloud
elastic-panel-demo 104.236.43.21 · Ubuntu 24.04
elastic-agent
logstash 10,000 docs
petclinic
alerts fired
total events
connecting…
Live demonstration13 / 16
What you're about to see
~40 min in Kibana
1
Fleet · Hosts · the baseline
Elastic Agent running unprivileged on a Linux host, reporting CPU / memory / disk / network / filesystem / process metrics via the System integration. Log collection disabled per briefing.
Observability → Infrastructure → Hosts → elastic-panel-demo
~6 min
2
Logstash → Discover · apache with geoip
Logstash file input, grok COMBINEDAPACHELOG, useragent, geoip filter. ECS-compliant output. 10,000 docs in apache-access-* across four May 2015 days.
Top countries: US 3,908 · France 865 · Germany 564 · Sweden 425 · India 423
~10 min
3
Dashboard · 5 panels on one canvas
Top 10 client countries · status-code distribution · requests over time broken down by status · Maps-integration client location layer · 95th-percentile response bytes over time.
Every field backed by real enriched data — not a mock
~6 min
4
APM · Spring Pet Clinic with errors
Elastic APM Java agent attached via -javaagent:. 1,500+ transactions across 7 Spring controllers. The /oups CrashController captures RuntimeException stack traces — drill-down to span-level.
Observability → APM → Services → petclinic → Errors
~10 min
5
Alerts firing · with Mustache-templated actions
Two metric threshold rules (CPU > 50%, memory > 85%) each with 2 actions. Server Log + Index connectors with Mustache template bodies — the Index action writes structured docs to panel-demo-alerts.
Alerts → Rules · Discover → panel-demo-alerts
~8 min
The final question, answered14 / 16
"How are the vitals of your local host while ingesting data? Any alerts triggered?"
Real values from the live dry run. Both rules fired on the first evaluation — the Pet Clinic JVM plus Logstash plus the agent itself was enough to cross the thresholds without any synthetic pressure.
CPU right now
system.cpu.total.norm.pct
Memory right now
system.memory.actual.used.pct
Latest alert firing
Waiting for data…
2 × 2
Actions per rule
Server Log + Index
Implementation notes
Rules use system.cpu.total.norm.pct and system.memory.actual.used.pct — the normalized/actual fields, not the legacy absolute-across-cores values the briefing cites. The normalized versions match what an operator means by "50%" and "85%".
Roadmap — what I'd harden for production
Wire a real email connector — Gmail OAuth or SMTP, 2-minute swap in Stack Management
Replace hard-coded thresholds with rolling 7-day baselines per host
ML anomaly detection on a deployment tier that provisions ML nodes
Top processes by CPU — live from system.process
Process
CPU %
Memory
waiting for data…
What I'd explore next15 / 16
Three directions I couldn't fit in 40 minutes
The briefing explicitly asks for "depth of curiosity about Elastic's products." Here's where I'd take this exact rig next — each builds directly on what you just saw.
NYC3 · target SSH brute-force trajectories · 33,415 events · 12 countries · 3 days exposure
Attempted — hit a licensing boundary
Cross-cluster search
I stood up a second ES node on elk.anthonygtellez.com with a Let’s Encrypt cert on the transport layer. The TLS handshake validates, but Elastic Cloud 9.x requires the cross-cluster API key for transport auth — which is an Enterprise-licensed feature on the self-managed side. The trial doesn’t include it.
Architecture is in place: on-prem node running, certs valid, transport port open. The licensing gate is the last step. On a production engagement with an Enterprise subscription, this connects in one API call.
Already collecting — live right now
SSH brute-force on this droplet
This host has a public IP with SSH open. I added Filebeat shipping /var/log/auth.log to a dedicated ssh-auth-* index.
invalid-user probes
accepted (legitimate)
latest probe: waiting…
Next step: Elastic Security detection rules for SSH brute-force, enrichment with threat intel, botnet classification.
Where I'm leaning in
Vector search on log messages
"Show me unusual patterns in the 404s" as a natural-language query, not a grep. Embed log messages with a dense vector model, index into the same cluster, expose via the _search API. This is where my work at BNY Mellon and SuriCon 2024 (Supercharging Security with RAG) lives — and it's the shape I'd bet on for the next 3 years.
Closing16 / 16
Thank You
Questions
are welcome.
Appendix material lives in the repo at github.com/anthonygtellez/elastic — 12 rendered architecture diagrams, the full risk register, per-phase deep dives, the live droplet topology, and a side-by-side of the briefing requirements against what I built.

Jump to anything the panel wants to explore.
LinkedIn/in/anthonygtellez
GitHub@anthonygtellez
Websiteanthonygtellez.com
Anthony G. Tellez