When the Cloud Fails ...

I’m excited about the cloud. Recently, I’ve done some research on Amazon Web Services, especially Elastic MapReduce which allows you to set up your Hadoop cluster in minutes as opposed to weeks or even month. This essentially means that in just a few minutes you can provision an infrastructure of 10-20 or more machine that crunch terrabytes of data per hour. And when you’re done with it, you just click a button, the cluster goes away and you stop getting billed for it. And this is just the beginning. With a few more clicks you can have your own relational database stack, with redundancy, automatic snapshot and a high speed cache of its own. Or you can have endless storage on S3 (or archived storage using Glacier). You can get your a content delivery network (CDN) serving your website data from all over the world, lowering response times and taking the load off application servers. You can build fault tolerant architectures by balancing requests across availability zones (you basically keep one set of machines on the East Coast and another set of standy-by replicas on the East Coast or in Europe).

All in all, the possibilities are endless. Your mind hurts just thinking about what you can accomplish using these tools. You basically have access to what multinational companies only dreamed of in terms of IT infrastructure. The best part is you get charged by the hour, so if you want to scale down or close shop the exit cost is almost zero.

But it’s not all fun and games.

The over-hyped dream cloud has the potential to become one giant storm when you’re not looking. Here are a few things you need to consider:

Downtime and Service Level Agreement

The Amazon instances (EC2 – elastic compute) or the Volume Storage (EBS – elastic block storage) you’re renting aren’t bullet proof. They can fail. They can be wiped out of existence with little or no warning. It’s not the “cute” sort of failure either: you cannot just reboot the machine and hope everything is OK. No, it’s the “whatever data you haven’t saved into S3 storage” disappears. There’s no backup by default. There’s no fail-over by default. And as far as the Service Level Agreement Amazon has, they’re pretty much covered: if you have unplanned downtime of over 0.05%, you can ask for a 10% refund at the end of the year.

This means that you have to employ several strategies to make sure (a) your data is safe and (b) you’re service doesn’t suffer from downtime. All and any of which will cost extra.

I’m writing down a few strategies, also marking the level of paranoia you should reach to employ them. You should also know that the more of these strategies you employ, you’ll row under less operational risk, but with more effort and with higher AWS cost .

(cautiously optimistic) Periodically backup EBS Volumes into S3. Have it setup automatically.
(cautiously optimistic) Set a backup policy for any Relational Database Service you may be running.
(slightly paranoid) Make sure your production EC2 instances can run in a load balanced environment. Set up load balancing between enough EC2 instances so that fail-over can occur automatically and that the remaining EC2 instances can handle peak traffic.
(paranoid) Set up load balancing across availability zones. This would protect you from downtime of an entire Amazon data center (which have been known to happen, again and again).
(paranoid) Set up EBS and RDS backups across availability zones. This is the equivalent of disaster recovery.
(paranoid with flying colors) Using a disaster simulation tool, such as Chaos Monkey (open sourced by Netflix) allows you to make sure your architecture can tolerate random failure in the infrastructure.

Estimating the cost for any of these levels of architecture and infrastructure robustness is not easy feat. It depends on what the use case is. Picking a level of fault tolerance depends on the type and size of your business or application and basically comes down to estimating loss/hour or loss/day in case of a failure

There are no recipes for picking the right balances between cost efficiency and fault tolerance. This post most definitely isn’t aiming at providing the correct answer. The purpose is to get you to ask yourself the right questions.

Performance

The cloud infrastructure you’re renting from Amazon isn’t made up of physical machines. This means that between whatever you’re running on your EC2 (operating system, services, application) will be subject to a performance penalty/overhead of the Xen hypervisior. You must understand that EC2 instances are actually virtual private servers and they are inferior in performance to a dedicated server.

Measuring the exact impact of the Xen hypervisor would of course depend on the application you’re running (and on the mix of CPU, RAM and I/O operations required), but there are benchmarks which report EC2 instance having a CPU performance ten times slower than dedicated instances with equivalent prices (the article also reports that I/O operations are 5 times slower on EC2). I would take any such extreme results with a pinch of salt, but the point is that Amazon Web Services add several types of overhead to the applications your running:

Virtualization overhead from the Xen hypervisor, which commonly impact CPU performance
Network overhead of communication between EC2 instance and the EBS volume, which are not necessarily in the same physical rack. There are ways around it and improvement being made, based on RAID-ing several EBS volumes together and on provisioned IOPS (which means paying extra for better IO performance)
Varying performance of EC2 instances (probably depending on the load of other instances provisioned on the same machine)
Varying performance of network communication between various AWS services (EC2 with S3, RDS, EC2)

To sum it up, there is a price to pay for scalability: that price can be paid either in terms of assuming risk (less predictable performance), in terms of assuming lower performance per instance (an consequently, less value-for-money) or by simply paying more. While AWS is a great service and runs a great business model, you should be aware that having virtually endless infrastructure at your finger-tips is no free lunch and no silver bullet.

Cost of Scale

This is a piece of universal truth in economics: you have to pay extra for the opportunity to change your mind or to worry about something later. The banking and insurance industry are built around that. And the Amazon Web Services business model is built around that. AWS is for IT infrastructure like banking and loans are for money – you get whatever you want now, but you pay interest for the fact you didn’t plan ahead.

AWS solves several problems for you, thus taking away overhead associated with initial investment (i.e. buying your own hardware before you need it and financing it), planning (i.e.figuring out how much you need before you need it), IT infrastructure operations (i.e. someone has to plug-in the server in the rack, install the OS, patch it, update it), risk, asset management and exit costs (i.e. decommissioning, and selling/auctioning of the hardware when you scale down). Moreover, the type of service AWS offers is indispensable for businesses which are subject to seasonal patterns (i.e. traffic peaks on Black Friday, Winter/Spring Holidays, reaching 5-50 times the average day-to-day traffic).

However, this whole care-free attitude which AWS enables comes at significant cost, direct or indirect:

You pay more for one-hour of CPU than if you rent equivalent or better hardware on a long term commitment.
You get poorer performance out of an equivalent configuration because of virtualization overhead (less performance per dollar). You can find a cross-provider cloud performance benchmark here.
There may be hidden costs (or at least, not so obvious costs) such as getting charged for writing from instance to storage volume or getting charged for traffic across availability zones (data centers). Although these charges make sense from a business stand-point, they are easier to miss since most dedicated servers providers don’t charge you for writing to your hard drive.
There may be higher hidden risks (or at least, not so obvious). If you buy a physical server and you don’t have a periodic backup procedure and preferably another stand-by or active fall-back server, on a long enough timeline you have a strong probability of having downtime and potentially losing data. This is universal, not matter if you use EC2 instances, dedicated servers or on-premise hardware. What is different is that the rate of failure for EC2 instances and EBS volumes is higher than for dedicated servers (Amazon reports it an Annual Failure rate of 0.1-0.5%, but I found no conclusive studies to date). There are fall-back options (like this one), but you need to plan for them and you need to pay for them (paying for additional storage for backup images, paying for load balancing and for a secondary active/passive EC2 instance). So you either pay extra for eliminating this risk or you end up like this guy.

The point is not that “the cloud is evil“. The point is that it has different limitations, costs and risks than one would initially imagine. It is most definitely more suitable for some businesses than for others. Before jumping into the cloud, one should account for the actual business needs in terms of availability, performance and availability, consider reducing cost either by taking some long term commitment or taking acceptable risk and most importantly – architecting infrastructure and operations so as to achieve fault-tolerance.

Why Shoud I Care?

You might think you don’t care. You’re not building cloud apps. You’re not designing cloud architectures. What you do for fun and profit might have nothing to do with IT. Right?

Wrong.

It has more an more to do with it.

The more you store your pictures online on Instagram or Deviant Art instead of burning them on DVDs. The more you keep your contacts and your connections on Facebook and Google. The more you use Gmail or Yahoo. The more you have your finances and your documents online on Google Docs or Office 365. The more you move stuff out of your house and out of your computer and out of your business into the cloud, you do care.

The fact that I did research on Amazon Web Services lately is purely a coincidence. I came up with the idea for this post after one very simple thing happened: some of the documents from my Google Drive disappeared. For most of them, I had no backup. Fortunately, they got restored a few months after I reported the issue. But it was enough to get me thinking.

Be First to Comment

Leave a Reply

When the Cloud Fails…