Saturday, 15 June 2013

Dark clouds on the horizon? A fatal flaw in the cloud services ideology

We have all heard of the cloud, right? That fantastic place where you have no server worries, everything remains up, backups are reliable and networks never fail. The cloud is instantly scalable and is cheaper than having your own servers. You get 24 hour, non stop support and monitoring with automatic fault correction.

The magic of the cloud is made possible by economies of scale: In the "old" world, when everyone has their own server(s), companies need to plan capacity for the maximum possible usage, the peaks. This means that on average, the servers are underutilised. The same is true for the network connections and the rest of the service infrastructure including the personnel required to keep it all working.

As different systems have different peak times (performance profiles) the greater the number of systems, the more the peak loads are spread. Let's imagine a system (A) that peaks at 4 "CPU requirements". And another system (B) also peaks at 4 "CPU requirements. In the "old" world, these two systems would each have their own capacity. I.e. 8 CPU "capacity". However, if these two systems peak at different times, and are otherwise idle, then a system of 4 CPUS is adequate to run both systems, provided that the "CPU capacity" can be supplied on demand.

By sharing the resources on demand, we have therefore saved "4 CPUs" of capacity. If the costs are shared between each system, then the costs for each system are simply 2 CPUs (4 CPUs / 2 systems) but they have in reality a 4 CPU capacity. I.e. each system gets its infrastructure at 50% the cost.

Different systems peak at different times, especially when you consider a global capacity that spans the World's time zones. Scale this up to 10 000s of systems (or more) and you have very real savings with an even load (that typically follows the Sun).

This is the fundamental business model of the cloud.

Cloud applications
The logical extension to infrastructure provision is the provision of why that infrastructure is needed: To run software. What's true for infrastructure applies even more so to software. Software costs money. It is expensive to build, it requires expensive IT infrastructure to run and it is expensive to maintain.

Software maintenance is approximately 80% of the cost of software across its lifetime. From bug fixing to feature enhancements, software maintenance is an going, costly business.

Just as with cloud IT infrastructure, spreading the costs of software across a large user base creates economy of scale: The same software can be utilised by many independent users as opposed to each user developing their own copy.

It is therefore not surprising that with the very real cost benefits of software as a service (cloud applications), this is a fast growing sector in the IT landscape. Users simply get to use the software, exactly as you would a utility such as electricity. It automatically scales to requirements and issues such as backups and uptime are taken care of.

Dark lining
There is, however, a fatal flaw in the cloud paradigm. Let's imagine a system that provides sales information. This system makes data available to other systems, such as accounting, invoicing and management reporting (MI).

Changes to the Sales system therefore need to be carefully planned. It is unlikely that an upgrade to the Sales system would be permitted, for example, during year end preparation as an error in the feed to the accounting system could have a severe business impact on the Year End activity.

IT departments typically need to carefully plan changes to integrated systems taking not only the changes into consideration but also the impact of these changes in the wider ecosystem. Scheduling upgrades are as much a business decision as they are a technology decision.

Grey applications
Most IT services are integrated. Typically this is via a published API or by the provision of feeds to other systems. This integration is usually multi-level, with systems integrated with the source system themselves integrating into a wider ecosystem and so on. In our example, the invoicing system might itself provide feeds to an invoice printing service and also a VAT accounting system. The VAT system could, in turn, interface to the public service provided by the UK government's HMRC department, a third party, independent system that is itself part of a very large IT ecosystem and integral to the entire British economy.

These direct integrations are typically well defined and documented within IT departments.

However, a data warehouse that pulls data from the Sales system, the invoicing system and the MI system described in our example, could expose data internally to the company via an API. For example, an accounts team may have spreadsheets that directly access the data warehouse and produce cash flow analysis. These spreadsheets could be freely propagated amongst the accounting department and provided that the users pass authentication and security, they will have access to the data. Moreover, individual users may create further spreadsheets accessing the data warehouse information. These additional spreadsheets are adhoc in nature, they are not a part of the core IT service but are valuable tools to other parts of the business.

These spreadsheets are an example of grey applications: Applications that are legitimate and authorised usage of the IT infrastructure's services, but are 3rd party to that infrastructure.

Grey applications are a significant class of software and IT usage in most organisations.

Now consider the implications of the Sales system being updated such that sales totals are no longer inclusive of VAT. Whilst this change would enable the MI reporting system (which is a part of the core infrastructure and hence known to the IT department) to be upgraded so that it continues to correctly produce its reports, the impact on grey applications is unknown.

Grey applications can themselves be data sources for other systems which in turn can act as sources for other ecosystems. This is a tree of dependencies. I.e. it is exponential.

This situation is caused by making services and data available by public APIs, the very strength of the cloud. In fact, cloud services are designed to be "plugged" into other services and software to enable users to build complex IT solutions at commodity, off-the-shelf prices. Many successful cloud applications today are specifically designed to be the foundations upon which an entire IT infrastructure can be built.

I.e. They are built specifically to be upstream systems. They are directly connected as well as grey applications enablers. In fact, it can be argued that in the cloud, ALL downstream services are grey applications.

Now consider upgrades to cloud applications.

Whereas previously the IT department worked with the business to schedule changes to key systems at times of reduced risk to the business (our year end example) and planned and scoped changes with the entire ecosystem in mind (our connected systems example), with cloud services and a vast, shared online audience, this is simply not possible.

By their very nature, cloud systems cannot behave as internal IT departments do, they cannot liaise with and work in concert with their end users and the associated business users to plan, schedule and participate in upgrades or changes.

Risk mitigation strategies are not just unavailable, they are simply not a part of the solution.

This means that the impact of any fault with upgrades or unintended consequences of changes (e.g. sales figures no longer including VAT) propagate downstream which exponentially increase in number of systems within a few short levels of the connection tree. I.e. unintended consequences and faults are amplified by the downstream connected ecosystem.

Moreover, these changes can occur at business critical times, maximising the impact on businesses should things go wrong. In fact, given the very nature of the global reach of cloud solutions, it is almost guaranteed that the changes will be during business critical times (that would not have been scheduled by internal IT departments) for a proportion of clients.

I call this mechanism amplification because it multiplies up in scale the impact of changes.

A previous global example
The global financial crisis that began in 2007 with a credit crunch was the result of debtors in a few states in America having repayment difficulties. The connected nature of the world's financial system meant that over a two year period, what began as isolated incidents in the USA compounded into a global crisis that threatened the existence of the Eurozone and affected every nation on the planet.

In the cloud, time is measured in microseconds, not months or years.

Feedback into the loop
Problems that propagate in the highly interconnected world of the cloud means that they can return back into the system. The connections online do not form a straight line from a source system through a series of intermediate systems and terminating in end systems. There are many interconnections along the way. This means that a fault that propagates from a source to downstream systems can at some point also receive input from a system downstream to one of its own downstream systems. I call this feedback.

As cloud software adoption continues, the interconnections increase exponentially and feedback becomes more likely. This means that minor faults, through feedback, can evolve into major faults.

The combination of feedback and amplification means that not only are faults amplified, but become more severe.

Development slowdown
As cloud services gain users, upgrades become more risky both in terms of direct risk of failure but also in terms of unintended consequences. Increasingly online software vendors will need to be aware of, and plan for grey applications, amplification, feedback and unintended consequences.

If providers are not going to risk disaster for their clients, they will need to take steps to protect against faults and unintended consequences. That means, for example, that instead of changing an API (e.g. version 1 -> version2), the provider will run both versions in parallel, maintaining and supporting both. The same principle will have to apply to key services, data schema changes and other normal changes to software over its lifetime.

This means that software maintenance becomes more complicated, requiring more resources, changes are bigger projects to implement and testing requirements increase.

It may appear from this posting that cloud services are a bad idea or a national or global threat. However, cloud services are enablers, commoditising IT services and enabling users who previously simply did not have the infrastructure or could afford them, to benefit from advanced, strategically important, heavyweight IT solutions.

The cloud is also highly competitive, forcing solutions providers to constantly innovate. This represents a serious IT spend benefit for clients without having to actually spend the money.

The principles of amplification and feedback don't only apply to faults and negative effects of changes, these same mechanisms serve to multiply the benefits of improvements and progress, delivering great return on investment and providing ever increasing leverage.

Cloud services are here to stay and will continue to grow both in capability and strategically. They are and will change every aspect of our lives, both private and corporate.

We therefore need to add the concepts of grey applications, feedback and amplification to our vocabulary, be aware of them, propagate these ideas and plan for the events and implications that they warn of.

With suitable planning, awareness, inclusion in education and possible regulation, we will be able to use these principles to leverage cloud advantages while defending against the potentially catastrophic consequences of blindly stumbling down the cloud path unaware of this new IT reality that we have created.

The cloud has changed the computing landscape. We need to ensure that we have a suitable understanding of this new land.

No comments:

Post a Comment