Skip to main content

Metrics That Matter – A Model For IT Decision Making

Author by Randy Steinberg

Many IT organizations mistakenly start their IT metrics efforts by looking at their monitoring and reporting tools and then backing into the metrics that those tools provide. Improvement focus then becomes limited to the tool sets available versus what is really critical to success. 

We’ve also seen the monthly IT service reports or web portal that few look at. Many times, these have been thrown together in reaction to management demands that IT must report on metrics. They are frequently ignored because the metrics in them have little meaning or significance for the audience viewing them. There is opportunity to proactively provide value to IT leadership by giving them the information and capabilities to make timely accurate decisions and initiate improvement actions when necessary.  

Question: What do the following metrics have in common?

  • Number of changes performed this month
  • Number of Severity 1/Severity 2 incidents that occurred this month
  • Number of high priority requests that were fulfilled this week
  • Server availability percentages
  • List of routers and their availability this month

Answer: They are all commonly reported on by many IT organizations, yet they only indicate interesting historical events that occurred and have almost zero significance for IT managers and executives. They do not  lead to decision making actions to justify, direct, intervene or validate key IT business decisions. 

Here are some examples of metrics that lead to decision making:

Example Metrics That Matter

Key Management Concerns Addressed

Percentage of all IT changes (and deployed applications) processed on time with no errors

Is IT is effectively handling business change without disrupting day-to-day activities?

Percentage of IT planned labor hours versus unplanned labor hours

How much labor and people costs are being spent on providing value for the business versus just keeping the wheels running?

Percentage of incidents that disrupted end-users

What return on value are we receiving for our investment in monitoring and automation tools? (e.g. IT resolved 1,000 incidents this month but only 10% were ever seen by the business so there is monitoring value versus a situation where the business sees incidents before IT does)

Each of the above examples has great significance for management. When viewed within acceptable and non-acceptable tolerance ranges, they lead to decision making actions to investigate and make improvement changes to the IT organization.

When struggling to identify the “metrics that matter” to your IT organization, consider the following three questions to ask about each metric that you are proposing to collect and report on:

  • What decisions or actions are we expecting management to make when they see this metric?
  • Does this metric indicate whether key business outcomes are being achieved?
  • Will management really care about what the metric is indicating?

If the metric you are proposing yields a NO answer to all three of the above questions, then most likely it carries little significance for management or leadership. To illustrate:

Consider the metric: number of changes performed this month. If we run this metric through the three questions above, it becomes apparent how little significance this carries with IT management. If we report that 1,000 changes happened this month, is this a good or bad thing? What actions or decisions should management make about this? Other than IT, does anyone else care?

Now look at this metric: Percentage of all IT changes processed on time with no errors. What decisions or actions might occur if that percentage came in at 40%? Wouldn’t this trigger a call for management action?

A suggested approach is to use a simple system consisting of Operational Metrics, Key Performance Indicators (KPIs – these are the “metrics that matter”), Tolerance Thresholds and Critical Success Factors (CSFs – statements that indicate “what must be done for success”).

These can be put into a model that might look like the following:

 

Operational Metrics

These are basic observations of operational events. They represent data used to support all other metrics. They will be used to calculate the Key Performance Indicators. Inputs for these can come from a variety of places such as a Change Management System, Incident Management System, Service Desk Reports, Surveys and other means. Examples might include:

  • Total Number Of Changes Implemented
  • Number Of Incidents Reopened
  • Number Of Problems In Pipeline
  • Number Of Calls Handled
  • Customer Satisfaction Rating
  • Total Expended IT Costs

Key Performance Indicators (or KPIs)

These are the metrics that matter. They lead to actions and decisions. In this regard they represent information about your IT organization. KPIs can be individual metrics or calculated using two or more operational metrics. Consider the following example that combines two operational metrics:

The above would indicate whether the Change Management function is working and providing value.

Tolerance Thresholds

These are high and low indicators that outline, based on agreed assumptions, a call to action. Here are some examples using the Change Success Rate KPI shown previously:

Tolerance Threshold

Likely Action

Value is 95% or above

No action needed

Value is between 85% and 95%

Closely monitor how changes are being processed and take action if this occurs for several months in a row

Value is below 85%

Take immediate action to identify root cause and fix change management issues

 

Critical Success Factors (CSFs)

These represent those things that must be done to ensure business outcome success. They provide guidance or guard rails about the true effectiveness of your IT organization through goals. The metrics outcome is either TRUE or FALSE.  Either the organization is meeting its goals or not.

Each CSF consists of one or more KPIs that relate to it. If all KPIs linked to a CSF are in acceptable tolerance ranges, then the CSF is being met, else it is at risk. CSFs can be tied to key business or IT processes and outcomes. A set of these (i.e. for the Change Management process) might include:

  • Protect Services When Making Changes
  • Make Changes Quickly And Accurately In Line With Business Needs
  • Utilize A Repeatable Process For Handling Changes

The Change Success Rate KPI example would most likely apply to the first two of the above CSFs. In other words, a low success rate means that we are not protecting our services when changes are made and probably not making them quickly and accurately to satisfy business needs. In a business undergoing rapid change, such as a new merger or seeking new markets, failure to achieve those two CSFs could be deadly.

Reports and Dashboards

These provide the presentation of metric results to different audiences who will then utilize their experience and wisdom to identify needed improvement actions and decisions. Items such as CSFs and their attainment can also appear as candidates on an executive dashboard. KPIs and their supporting operational metrics may appear as drill-down detail to IT management.

What About Things That Cannot be Measured With Tools?

What to do about metrics that can’t be easily measured with the existing tools the organization has in place? The wrong answer is to ignore the metric and not report on it. Not measuring can be worse than using temporary measurements to provide some level of visibility into what is going on.

For these, some approaches can include:

  • Taking analogous measures – wherein one metric will be assumed to represent the other – for example: “all calls to the service desk will be considered incidents”
  • Using random audits to represent the state of the metric – for example: “availability of servers A, B and C from 12pm to 3pm on Wednesdays will be used to state availability for all servers in our infrastructure”
  • Using one metric that can be measured to represent one that cannot. For example, if the number of staff working on web portal incidents throughout the day exceeds 2, then it will be assumed that the acceptable service incident rate is too high for the web portal
  • Programming scripts to simulate transactions to represent response time or availability of those transactions

While some might challenge the accuracy of the above approaches, they at least provide some measure of visibility. To gain more accuracy, the organization can opt in for new tools or improvement projects to gain better visibility.

 

Author

Randy Steinberg

ITSM Process Architect

Tags in this Article