One of the biggest time drains for IT support teams when dealing with business critical issues at the point of failure, is jumping to ‘false causes’ at the first point of contact with a problem. Often it is human instinct to jump in quickly and attempt to ‘fix’ a problem especially when you are working with experienced and highly skilled professionals within the team. Companies quite rightly, hire the best people they can find to support their IT, people with knowledge, experience and skill, but interestingly this in itself can lead to its own problems.
The likelihood is, 90% of the time IT professionals have witnessed a failure before, and will know exactly what steps to take to resolve it. At the same time, systems are increasingly complex and often what we see is a symptom of the problem, rather than the actual cause. For example, it may be the case that the last five times we saw a drop in transaction rates on a critical payment system, that it was triggered by a security upgrade. This time the symptom looks exactly the same as before, so we roll back the upgrade and that doesn’t help. It later transpires that on this occasion our network provider had performed some routine changes in the routing protocol, which had affected the data verification.
The time wasted on dealing with the false cause is a frequent situation, but this kind of time drain can only be eliminated in the future if we start to place an equal reliance on data as we do knowledge and past experience, because experience without adequate information can cloud our judgement.
The assumptions we make are usually exacerbated by an over-reliance on the knowledge and experience we have, but at the same time customer or user pressure to get things fixed instantly is rarely helpful, as it encourages speed at the expense of effectiveness A good analogy is, if your loved one was in emergency surgery, you wouldn’t shout at the surgeon to hurry up with the operation, nor would you make suggestions about where he or she should make an incision.
The problem with IT support teams occurs when we only hire people for their technical expertise, not necessarily someone who is confident dealing with stressed-out colleagues or customers while under pressure. Thankfully many firms are starting to realise that being a ‘techy geek isn’t the only, or even the most important qualification to look for when recruiting. It is far more useful to have an IT team that possess a variety of mixed skills and talents. Many firms deliberately staff the first line of contact with nontechnical (and sympathetic) personnel, so that the customer has a chance to air their frustrations, before the serious business of diagnosis and correction begins.
Count to ten and take three deep breaths
Managers also need educating on dealing with business critical scenarios too. Today it’s becoming unacceptable to call yourself a leader and then to simply pass the pressure onto your team when something goes drastically wrong. Managers are under pressure from lines of reporting above them, and they must realise that it’s part of the job to take that pressure and to present a calm and focused attitude to their teams. Many would argue that the old-fashioned advice actually works: count to ten, take three deep breaths and then speak. Technical or not, it’s also important for managers to understand the most effective flows of problem solving, so they can quickly recognise when something’s not working and give assistance.
Holding large-scale meetings – especially conference calls- in an attempt to solve issues can also complicate matters especially when people go off on tangents. Different types of issues need different types of approaches. Where things are confusing, the conversation must be confined to clarification. There is no room for solutions if you don’t understand what you are dealing with. Where there’s a clear departure from normal function, then people need to be in investigative mode looking for evidence and using that to figure out the causes.
In many cases, to get services back on track, there are frequently multiple options, and this conversation has to be about what those options are, and how to decide quickly what’s best. Try to keep an open mind too because things can always get worse so understanding and managing those risks involves another kind of thinking. To help meetings progress usefully, you can draw boxes around these four different types of work: Clarification, Investigation, Solutions and Risk, as this helps to retain focus and avoid confusion and going off track.
Gathering the right information at the first point of contact with a problem is vital because incomplete or inaccurate information can lead to an incorrect diagnosis resulting in the wrong actions being taken and more customer frustration. Too often when reporting issues, we tend to mix up what’s really there with what we think is going on and then present our whole story as if it were fact.
A simple example from the helpdesk:
IT: “Good morning, how can I help?”
User: “The server’s down again!”
IT: “I’m so sorry. What exactly do you see on your screen right now?”
User: “It’s my login screen.”
IT: “And what is that doing that it shouldn’t?
User: “It won’t accept my password.”
Although the IT support team are often technically qualified people, it’s short sighted to assume that all other users are clueless about tech. But as a user, what almost all of us do is embellish the real data with other stuff we think is interesting or helpful, without necessarily sorting out what’s relevant before we give the story to the helpdesk. Luckily, IT professionals are often able to use tools to see exactly what is affecting the service and get accurate data that way. If not, then they have to prompt their users by getting a detailed description of what they saw, and ideally the exact time it started, as well as the last time they were able to use that same service successfully.
To avoid jumping to false causes you have to focus on using all of the available data. The easiest way to avoid the trap of jumping to conclusions is to ask how the assumed cause fits the data. So for instance, if we have an outage in Data Centre 1, the Incident Manager immediately points to the change that was implemented overnight. If we know that Data Centre 2 is still working, we can ask: “How come Data Centre 2 isn’t affected? The same change was run on all Data Centres.” If any suggestion is valid, it must be able to show how it fits the facts on the ground. If it doesn’t, it’s simply not correct.
Focus on root cause
Probably the most important aspect of solving critical issues in the right way is to ensure that the IT support environment supports this approach of detecting and solving problems long term. There are some really big challenges here: engineers on the whole, rather like solving problems. Having said that, the time pressure often means that once the symptoms have been made to go away, there’s not much incentive to check they really did correct the root cause. That’s why ITIL has identified the difference between two essential processes: Incident Management to effect a speedy restoration of service and Problem Management to find and correct the cause.
In summary, the most important change an organisation can make is to focus on the quality of work: the completeness and accuracy of details recorded by support staff in all roles. How well have they captured the data? How completely have they explained how their assumed cause fits the data? Some organisations make regular quality checks of case records and provide feedback; but most don’t, and without that all-important feedback, things are unlikely to improve in a positive direction.