A couple years ago, I learned a lesson from my CIO that has stuck with me. Marching towards an aggressive deadline to deliver a client capability, the CIO gathered his leadership team together and asked the VP in charge of the project how confident he was in the launch. At that point, we had demonstrated 99.1% accuracy across our test suite, and the team's consensus was that we could get up to 99.5% with an additional month of testing — at a cost of several million dollars. We launched on time, handled the inevitable issues, and had a satisfied client. Knowing the right definition of success and how to accurately measure progress allowed the team to effectively manage a multitude of risks.
In my Security Culture Manifesto, I wrote of three factors that define an organization's culture of security: pace, tone at the top, and feel in the trenches. With attacker advantages consistently growing, pace — the ability to quickly deploy strong controls with comprehensive coverage — can be a game-changer for organizations constantly needing to play defense.
I recently had the opportunity to collaborate with one of our IT infrastructure teams on a deployment of an important countermeasure. The proof-of-concept was successful, we were done with our pilot, and were ready to launch to the enterprise. The plan presented to me was roll the new tool out to the company in three waves of 20%, each separated by a week "to let the dust settle," culminating in a final 40% push (the "slow roll" approach in the graph below).
I challenged the team to think outside the box about the deployment schedule.
Did all that waiting in the plan have a purpose, and did it increase our likelihood of success? Could we go faster while preserving reliability and end-user experience?
I suggested a 5/20/75 shedule with a 3-day waiting period inbetween (the "fast roll"). My logic:
- Negatively affecting 20% of our users was still undesirable, so the first round could have been smaller - especially if there was legitimate concern the pilot couldn't cover enough use cases before hitting the point of diminishing returns on continued testing;
- The proposed waiting period was not one of active measurement (of CPU, RAM, bandwidth, response time, etc.) to determine if the next wave was safe to launch: it was a passive period meant to give users a chance to report issues, which usually happens in the first 24 to 48 hours; waiting longer did not significantly increase the likelihood of success, and created a risk of its own - introduction of other changes in the regular ebb and flow of IT systems that would make root-cause analysis more difficult if the new tool was indeed contributing to a problem;
- Our historical data from past software pushes did not indicate that the 2nd, 3rd, or 4th round had a higher likelihood of failure than the first, assuming the first succeeded.
In this case, "pace" was about giving our teams permission to succeed (or fail) quickly, and rewarding them for that speed... and for having a good plan that anticipated the most likely points of impact.
Knowing the right definition of success and how to accurately measure progress allowed the team to effectively manage a multitude of risks.
In addition to pace, the other hallmark I look for in any information security project is risk reduction. Specifically, I ask my teams:
- which sections of our threat model does the project address?
- what is the current state, and how are we measuring it (median time to detect event type X, # of events getting past layer A and impacting layer B, % of users exhibiting risky behavior Z, etc.)?
- what do we predict the reduction in these key indicators to be, if the project is successful, and are we prepared to measure the before/after impact of the new control?
Ultimately, the strategy I prefer is the "extended pilot" — this optimizes both pace and end-user experience by methodically measuring the impact of a new control over a large population of users; once proven effective, it allows for low-risk, aggressive deployment. It is only when this approach is not available (e.g. because certain tools are designed to mitigate attacks that are seen rarely or in unpredictable intervals) that I have learned to push for the fast roll. It may seem counter-intuitive at first, but when done right, moving with pace reduces risk both of compromise and of disruption.