Skip to main content

Radical Incident Reduction – The 30-30-30 Program

Author by Randy Steinberg

Want to pump up your IT delivery with a program that excites executives, provides great business value and greatly lowers unplanned labor costs, incident volumes and service outages? The 30-30-30 Program is a fast-paced attention getting program that targets a 30% reduction in incident counts, 30% reduction in resolution times and a 30% increase in support staff skills for getting to root causes and eliminating the incidents they cause. This presents how the program works and presents some examples used in a large IT organization. The approach described can be used in any IT organization.

One of the biggest headaches in any IT organization is that of unplanned labor. This is time spent by IT executives, managers and support staff to deal with incidents and outages, rework, and recovery activities. In today’s DevOps world, this is called technical debt. It carries a price tag. Typically $62.00 per hour on average for most IT organizations today. This cost doesn’t even include other costs such as project delays, poor customer satisfaction, penalties and fines that are triggered when everyone has to drop what their working on to deal with these unplanned issues.

The main objective of the 30-30-30 Program is to target this headache. The 3 key objectives include:

Reduce Incidents - Proactively trend incidents and initiate actions to remove their underlying root causes thereby stopping incidents before users see them and reducing unplanned labor costs

Reduce Incident Impact – For those incidents that can’t be readily removed, reduce times to initiate workarounds and resolutions taking actions to minimize incident impact

Increase Skillsets - Upgrading Service Desk and support team analysis and problem solving skills to reduce the amount of labor spent on analyzing incidents and getting to root cause.

Here are the steps you can take to get your program going:

Step

Action

1

Ensure reporting is in place for incidents

2

Determine levels of unplanned labor related to incidents

3

Agree the program with senior executive leadership

4

Train on problem management core concepts

5

Arm staff with techniques that get to root cause

6

Proactively raise problem tickets and monitor progress

7

Implement actions to resolve problems or reduce Impact

 

Step 1 - Ensure Reporting Is In Place for Incidents

Make sure you have adequate reporting in place that can provide basic metrics – these include:

  • Incident counts (ideally by service, application/device or support team)
  • Incident durations – typically from ticket creation to ticket resolution
  • At least 6 months to a year of ticket history

In addition, you will also need:

  • Problem Ticket logging and reporting capabilities
  • Continual Service Improvement (CSI) Register for logging and prioritizing resolution activities and their progress

While the above can be found in many IT Service Management systems, you can also handle these with spreadsheets if those don’t exist.

Step 2 – Determine Levels Of Unplanned Labor Related To Incidents

In this step you will baseline your current unplanned labor costs. Here is an approach (taken from a real IT organization) to do that:

  1. Get the overall IT annual labor budget ($125M)
  2. Get the number of IT employees  (1,100)
  3. Determine the average IT salary cost (A/B or $113,636)
  4. Estimate average hours worked in a year for IT employees (1,832 – an industry average)
  5. Calculate the IT hourly labor cost (C/D or $62)
  6. Get the Incident ticket duration in hours for each incident ticket (from your reports)
  7. Apply a Labor Factor (Percentage of ticket duration that incurred labor versus time the ticket was just sitting in a queue waiting – it’s okay to estimate – our example used 5% for all tickets)
  8. Calculate the ticket unplanned labor cost for each ticket (from above: E * F * G)  and sum this for each incident ticket
  9. Divide sum of unplanned labor costs (H) by overall labor budget (A) to get the unplanned labor rate

Step 3 - Agree the Program with Senior Executive Leadership

The results from Step 2 can be put in a spreadsheet or reporting tool and presented to executive leadership for their buy-in. Here is an example of what was shown to senior executives in our example IT organization:

Every dollar shown is wasted labor. Our example leadership did not like knowing that telephone incidents were costing them $2.3M a year, email issues $2.4M a year, etc. The target of the Program was to cut these costs by at least 30% across the board. This was what got buy-in from leadership to support the program and the areas highlighted above were to receive the most focus.

You can also highlight the unplanned labor portion of their IT budget. This is simply totaling all the annual unplanned labor costs and dividing by the IT annual labor budget. For example, if your IT organization spends $10M on labor and the unplanned labor adds up to $2M, then 20% of the IT budget is being wasted on unplanned labor activities – not a percentage many executives like to see!

It is important the executive leadership be visible and seen as leading this program. This will set the proper tone for staff and management support of this initiative.

Step 4 - Train On Problem Management Core Concepts

For this step, establish a series of communication events to introduce and train staff on the program and problem management best practices. These events should include:

  • An overview of the 30-30-30 Program
  • Training on Problem Management processes and techniques
  • Staff and management roles and responsibilities for the program

Guidance and resources for Problem management processes and techniques can be found in many places throughout the internet. The ITIL 2011 Service Operation Book (Problem Management chapter) and 3rd parties such as Kepner-Tregoe and others also come to mind.

Step 5 - Arm Staff With Techniques That Get To Root Cause

As part of training, it also helps to establish a Problem Management Toolkit as a support aid for IT staff and management. This is just an inventory of different techniques that can be used to quickly control activities and drive down to root causes. A list of these might include techniques such as:

  • 5 Whys
  • Fault Isolation
  • Hypothesis Testing
  • Observation Post
  • Pareto Analysis
  • Kepner and Tregoe Analysis
  • Affinity Mapping
  • Chronological Analysis
  • Pain Value Analysis
  • Fault Tree Analysis
  • Trend Analysis
  • Causal Loop Analysis
  • Service Outage Analysis
  • Problem Brainstorming

While not enough space here to delve into each in detail, much information on any of these can be found on the internet (type any of the above into a search engine). The ITIL book chapter mentioned earlier also describes many of these.

As a starting point, focus on the 5-Whys. This is very simple but also very easy to use and effective. The steps in this technique are:

  1. Describe what took place
  2. Ask “Why?”
  3. Listen to answer given
  4. Ask “Why?” again
  5. Repeat steps 2-4 until root cause is identified

As a guide, the table below lists the above techniques and what situations they may best be deployed in:

Problem Situation

Suggested Analysis Techniques

Complex problems where a sequence of events needs to be assembled to determine exactly what happened

  • Chronological analysis
  • Technical observation post

Uncertainty over which problems should be addressed first

  • Pain value analysis
  • Brainstorming

Uncertain whether a presented root cause is truly the root cause

  • 5-Whys
  • Hypothesis testing

Intermittent problems that appear to come and go and cannot be recreated or repeated in a test environment

  • Technical observation post
  • Kepner–Tregoe
  • Hypothesis testing
  • Brainstorming

Uncertainty over where to start for problems that appear to have multiple causes

  • Pareto analysis
  • Kepner–Tregoe
  • Ishikawa diagrams
  • Brainstorming

Struggling to identify the exact point of failure for a problem

  • Fault isolation
  • Ishikawa diagrams
  • Kepner–Tregoe
  • Affinity mapping
  • Brainstorming

Uncertain where to start when trying to find root cause

  • 5-Whys
  • Kepner–Tregoe
  • Brainstorming
  • Affinity mapping

Step 6 - Proactively Raise Problem Tickets and Monitor Progress

At this point, have IT staff, management and support teams undertake activities to proactively identify root causes and activities to reduce incidents and their duration. Some key considerations when undertaking this effort include:

  • We want low-hanging fruit – don’t spend lots of time identifying actions that yield little reduction – however tiny efforts and no-brainers should be noted as many of these done in aggregate might yield significant savings
  • Focus on outcomes - every reduction opportunity proposed must provide an estimate of how many tickets will be avoided and an estimate for unplanned labor cost savings (using the approaches described in Step 2 earlier)
  • Document the cost saving opportunities from Step 2 above. – these will be used to confirm and prioritize everyone’s actions (see Step 7 – findings may show that many teams might need to be involved to implement the improvements)
  • Recognize that root causes may not always be technical (e.g. lack of skills, training, communication, no ownership, poor vendor support may also be root causes)

Some key Program roles and activities to undertake from a program perspective can include:

Program Role

Key Program Activities

Problem Manager

  • Owns the program
  • Trains support staff
  • Assists in identifying problems, root causes and action items
  • Coordinates cross-service issues
  • Coordinates how activities will be prioritized
  • Monitors progress on actions being implemented
  • Monitors support team compliance to the Program

Service Owners and Support Team Managers (Including the Service Desk)

  • Reviews data for their service or support team
  • Identifies problems and known errors
  • Identifies options and actions to  undertake to remove those errors

Service Manager, Continual Service Improvement Manager or Project Office

  • Assists with business cases
  • Puts actions on CSI Register
  • Assists with prioritizing actions to be taken
  • Monitors the register to make sure actions are being implemented

Step 7 – Implement Actions to Resolve Problems or Reduce Impact

A register should be maintained as each team identifies improvement actions. Actions get logged into the register along with their estimated cost saving opportunities. The Problem Manager then coordinates agreement across IT to prioritize which actions will take place. A monthly meeting is suggested to do this along with a review of progress of any actions agreed in previous meetings.

At a minimum, the CSI Register should contain the following:

  • ID Number to quickly reference the action (e.g. X001)
  • A brief title or description of the action (e.g. Fix Application XYZ User Lockouts)
  • A detailed description (e.g. Remove duplicate active directory entries to avoid access errors…)
  • Effort estimate (e.g. 1 month, 3 months, 9 months, etc…)
  • Savings Estimate (e.g. $40-60K)

The above is reviewed by the Problem Manager and key stakeholders to prioritize and identify action items to be undertaken. Those that are approved are then assigned to support teams and service owners for implementation. 

To communicate Program activities to key stakeholders and executives, the chart below presents one way of summarizing what the Program is undertaking:

Figure 2: Cost Savings Presentation Example

At a minimum, the Program should provide communications on a monthly basis showing business based outcomes such as lower costs and incident counts.

On an ongoing basis, the Problem Manager also monitors compliance to the Program. This can be done by looking at the incident counts for each support team or service and comparing that against things like number of problem tickets raised, total cost savings identified and implementation assignments. The bottom line is to make sure each team or service is proactively working to improve things and not just being reactive.

Challenges You May Run Into

As a last note, the table below identifies challenges you may run into with this Program and what you might do about them:

Challenge Or Risk

Suggested Mitigation

Staff struggles to find root causes

Consider stronger use of Problem Management Toolkit techniques

Not enough time to proactively find problems

Time box efforts e.g. commit to 2 hours per week to focus solely on problems or assign a resource part time

Staff is unsure how Problem Processes might work

Contact Problem Management Team for guidance

Service supported is problematic and has many issues

Focus on improvements prioritizing which problems provide the greatest bang for the effort and whittle away in small chunks

Too many escalations from Service Desk for similar issues and incidents

Make sure you publish Known Errors allowing the service desk to better resolve issues independently

Can’t always determine the business impact of outages

Start to check the CMDB, ticket documentation or prior instances of the issue

IT doesn’t want to fund implementation activities

Rely on Known Error and Workarounds but keep improvement actions logged in the CSI Register with their cost opportunities

 

Author

Randy Steinberg

ITSM Process Architect