Remote Hypercare

Cyber Week is historically the most important week for TeamSPS, and during this time we give special attention to our systems and customers while executing on a hypercare playbook. Working remotely this year brought forth some extra challenges in execution. The following are few ways we were able to adapt our hypercare playbook to accommodate our team being fully remote.

  1. Hypercare lounge is open. Using Zoom, we recreated the in-office feel by keeping a channel open at all times for the team to gather. We were able to have in-depth discussions around current system performance, and get the right people involved quickly when needed, just as if we were all together in the same physical location. Where there was downtime, this also provided a platform for team bonding and casual conversation about food, kids, sports, and some jokes.
  2. Defined roles and responsibilities. Similar to our incident management procedures, we defined roles within our hypercare playbook for team members to know exactly what was expected from them. The hypercare commander was the current lead and main decision maker. They were driving awareness and engaging conversation proactively with the on-call subject matter expert. They were also leading our standup calls, orchestrating our clear checks (see #3), and managing all communication channels. The on-call subject matter expert was the person on-call for a service delivery team. They were spending time observing, and reacting to, system performance. They were also participating in standup calls, and performing our clear checks. There were other supporting roles such as hypercare marshal and customer captain too.
  3. Clear check procedures. In an effort to keep our on-call subject matter experts (SMEs) engaged in oberserving our production systems, we performed a “clear check” on all services multiple times throughout the business day. The clear check is a procedure to review key performance indicators for services and confirm they are operating as expected. A Slack thread was initiated by the hypercare commander for the on-call SME to perform a clear check on their service and report back current status. This allowed any concerns to be brought forth proactively to the larger team to be investigated quickly.
  4. Pager Duty as the source of truth. In previous years during hypercare, we had required teams to cover an in-office 12 hour shift split between multiple team members during the busiest hours of the day. We did this by publishing a calendar where folks would sign up for their on-site and on-call shifts. Instead of a published calendar, we used schedule overrides in Pager Duty to fill out our hypercare shifts. In a given 24 hour timeframe, we required at least three different SMEs rotating the on-call shift to ensure fresh eyes and minds at all times of the day.

Another sucessfull Cyber Week in the books for TeamSPS! Cheers!

Megan Tischler, Continuous Improvement Manager

Megan Tischler @mtischler