Moving digital infrastructure to the cloud is steadily becoming best practice for businesses, because of numerous associated benefits – including flexibility, cost efficiency and overall performance. This has been accelerated by digital transformation, and consequently, the International Data Corporation (IDC) has predicted that global spending on digital transformation will rise to £2 trillion in 2022. But this rush to online and cloud-based processes has left some companies without a clear strategy for their IT monitoring solutions, resulting in a smorgasbord of unconnected and unfit monitoring tools. Consequently, the threat of disruption through outages is ever-present for those who employ unsuitable monitoring tools.

The real impact of IT cloud outages is distorted because they often go unreported. Companies proudly disclose figures calling attention to their low number of outages, but the reality is that just because they haven’t experienced a total shutdown doesn’t mean an outage hasn’t occurred – just that they have managed to keep service running at a lowered capacity. This suggests that IT cloud outages are far more prevalent than the data suggests. Therefore, the need for increasingly robust and centralised monitoring software is essential for system-wide visibility, to spot performance issues for both physical infrastructure and the cloud before it’s too late.

No company is safe. Just think, 2018 saw companies like Google Cloud, Amazon Web Services and Azure experience highly disruptive cloud outages, with far-reaching consequences both financially and to their reputations. The financial sector was hit particularly hard and had a tumultuous year. A Financial Conduct Authority report showed that in Q2 2018, Britain’s five largest banks, HSBC, Santander, RBS, Barclays and Lloyds Banking Group, suffered 64 payment outages. As a consequence, the FCA has now decreed that two days is the maximum limit for which financial services can have services interrupted, due to this plethora of disruption. However, for those who want to remain competitive, they should really be aiming for a zero downtime model, as customers will no longer stand for poor-quality service.

The severity of major outages has been shown by an illuminating report from Lloyd’s insurance and risk-modeller AIR Worldwide, who calculated that an incident for one of the top cloud providers (like Google or AWS) in the US for three to six days would result in losses to industry of $15bn. It’s abundantly clear that organisations cannot afford outages of any kind, and that appropriate measures must be utilised to mitigate outages happening.

The consequences of IT outages

The current digital climate leaves no room for negative customer interaction; modern technology has led people to expect a constant high level of service. The ‘always on’ mantra means that any disruption to day-to-day services have a debilitating effect on customer trust. With so much choice, flexibility and at times even incentives to switch providers, disruptions can cause customers to move to competitors – so firms can no longer risk a band-aid over the bullet hole approach.

In April 2018, TSB had a catastrophic incident when an error made during an IT systems upgrade led to 1.9 million people being locked out of their accounts, some for up to two weeks; all told, the bank lost £330M in revenue. In the same month, Eurocontrol, the airport management system for many of Europe’s airports, had an outage that left 500,000 passengers stranded across the continent. British Airways also experienced an outage with its third-party flight booking software. With 75,000 travellers affected over the three-day period, it lost an estimated £80m in revenue and a further £170m off its market value. With a Gartner report asserting that such outages could cost up to $300,000 per hour, the need for a unified solution is key to effective IT monitoring.

Although the financial ramifications of an outage are plain to see, regardless of the sector you operate in, organisations need to exercise effective crisis management and be upfront with their customers. And when outages do occur, organisations need to relay reliable and up-to-date information to their stakeholders to mitigate damage to their reputation. TSB showed exactly how not to do this when after 12 days of its outage, it insisted on social media that ‘things were running smoothly’, even though some customers hadn’t had access to their bank accounts for nearly a fortnight. TSB resultantly lost 12,500 customers.

Why a unified approach is key to success

Gaining insights into an IT system’s performance is always a challenge, especially with the growing issue of ‘tool sprawl’ that many companies either opt for in desperation or are stuck with due to decentralised systems which don’t communicate with each other. Organisations are often reluctant to update their systems because any disruption during implementation of a wholly new IT monitoring system can seem daunting or an update might even be too great a risk when weighed against a theoretical outage in the future; leading to many companies having sprawling IT systems that are continuously patched.

The key to countering the problem of cloud outages is a single pane of glass solution that provides visibility across all of a business’ IT systems. However, Enterprise Management Associates has said that many companies use up to ten monitoring tools at once, consequently creating data islands, diluting their data and averaging times of between three and six hours to find performance issues within IT systems. Simply put, companies commonly have unfit solutions in place that are built for static on-site systems rather than today’s cloud and virtual-based digital systems. By housing analytics and system data in a single unified tool, organisations will have a clearer picture of system health, availability and capacity at all times.

Outages are a fact of life; but companies should do their utmost to mitigate against them and, when they do occur, have the correct tools in place to find the issue and rectify it. This returns service in a timely manner reduces downtime and prevents loss of revenue. All this should be completed while keeping customers informed of progress – unlike TSB’s self-destructive ‘information vacuum’ approach.

As digital transformation begins to accelerate across businesses, IT systems will grow ever more complex – and the capabilities of effective monitoring tools will have to be deployed to meet the challenge. Regular threat and vulnerability assessments, along with reviewing configuration and operation process validation checkpoints can reduce the odds of suffering a critical failure. For this reason, the importance of a single-pane-of-glass monitoring tool, enabling the consolidation of siloed teams and the removal of any blind-spots caused by overlapping systems that isolate data and fail to communicate with each other, cannot be overstated.