Our investigation uncovered a in the CalEHot micro‑service’s caching layer, aggravated by a recent configuration‑drift in the Kubernetes deployment. The remediation plan – a three‑step patch, a rolling‑restart, and a post‑mortem automation – restored full functionality and introduced safeguards to prevent recurrence.
Users choose a specific "ticket type" based on the duration and the individual creator involved. calehot98 ticket
| Item | Detail | |------|--------| | | CALEHOT‑98 | | Opened by | Jane Liu (Support – Tier‑2) | | Date/Time Opened | 2026‑03‑12 09:17 UTC | | Affected Service | CalEHot – Real‑time pricing engine (Java 17, Spring Boot) | | Production Scope | 4 AWS regions (us‑east‑1, us‑west‑2, eu‑central‑1, ap‑southeast‑2) | | SLA | 10 business days for “Critical – High Impact” tickets | | Stakeholders | - Product Owner (Mike Alvarez) - Platform Engineering (Team “Nimbus”) - Customer Success (Sarah Patel) - End‑User (Retail Partner “FastMart”) | | Item | Detail | |------|--------| | |
(Screenshots attached in the full PDF version) | | 14:15 | Debug session reveals two
| Time (UTC) | Action / Observation | |------------|----------------------| | | Ticket logged – “Pricing API returns 500 for SKU 12345 in EU region.” | | 09:30 | Automated alert (Prometheus) shows CPU spikes on pods calehot‑v3‑* in eu-central-1 . | | 10:05 | Support reproduces error on staging – stack trace points to CacheProvider.get() throwing NullPointerException . | | 12:00 | Engineering triage identifies recent helm chart change (deployment v3.2.1‑rc2 ). | | 14:15 | Debug session reveals two concurrent threads writing to the same ConcurrentHashMap without proper synchronization – race condition. | | 16:00 | Temporary mitigation: disable cache refresh for affected pods; error rate drops from 27 % to < 1 %. | | Next Day (09:00) | Root cause analysis completed (see Section 4). | | Day 3 | Patch v3.2.1‑fix‑racing built, unit‑tested, and staged to dev . | | Day 5 | Rolling‑restart across all regions; monitoring confirms steady state. | | Day 7 | Ticket closed – “Resolved – Fixed underlying race condition, added regression test.” |