Rightsizing Your AWS Cloud Infrastructure: A Rumination

I am currently engaged in a mid-to-long-term project to rightsize the Cloud infrastructure at work. During our rapid change phase, shifting from a co-located infrastructure to public cloud, our primary priority was to get things done and minimise disruption while shifting services into the cloud quickly. Consider the things that cloud infrastructure should be

  • Cheap
  • Fast
  • Scalable
  • Resilient
  • Reliable*

During our migration phase, we weren’t too bothered about cheap. We wanted everything else, but the price ticket was… flexible, within limits. Now, some months later, we’re considering a number of our core services to be stable and mature, and therefore they’re prime candidates for aggressive cost optimisation. Cost optimisation in this case can imply a few different actions.

Instance sizing

First, getting the instance types right for efficiency. There’s no point running on a pair of c4.4xlarge instances if you only average 1% CPU usage over a given week. You may be better, depending on the app’s workload, running on three or four m3.mediums, or two c4.larges, or even a burstable t2 pair. You’ll want to be at mimimum size at minimum usage, and scale up to fit demand without interrupting service. It’s all very much workload dependent and there’s a hint of the dark art to it. Finding out an app’s current usage profile and predicting how it will go in future can be a bit like looking at tea-leaves. Perhaps your limiting factor is memory, in which case you’ll want to select a memory-specific instance type and tune downwards until you’ve reached the best cost to performance ratio. Perhaps CPU is a problem, in which case choose c4 instances. Maybe I/O is a problem, in which case you may want to tune your EBS volumes or even use a high-memory instance and employ RAMdisks. All this is service-specific and requires a lot of delving into performance metrics to get right, but the information is there if you need it.

Tenancy

So getting the instance size right is one thing. There’s another way to squeeze more bang from your buck, and that’s to co-tenant services. Early in the migration, each Domain microservice had its own isolated cluster, which was great. During the migration. Nowadays, we can move services into shared clusters, divided along functional or organisational lines, or according to usage profile. The way you’ll want to split them up is, again, a dark art. Splitting them up by organisational lines means that in the event of a cluster outage, you’ll only be waking up a single team, since only their functional area is impacted. But it may mean that the resource profiles of the apps you’re co-tenanting may not be a perfect fit.

Conversely, you may want to group together apps that serve REST responses, and apps that serve websites. This may work if you don’t have a common healthcheck endpoint for your ELBs to hit up. You may want to co-tenant offline processing tasks on instances used by customer-facing web apps, so that slack overnight CPU time can be used by offline processes. Maybe the potential impact is low enough to do that. Maybe it’s not. Only the metrics will confirm either way.

Thresholds

Another way to optimise is to tune your scaling thresholds appropriately. This, I find, is a particularly dark dark art. Our apps sometimes start to struggle when above a given memory or CPU threshold, and that threshold can be different from app to app, or even from deployment to deployment, and keeping on top of it is tricky. You may want to scale up a cluster at 80% average CPU usage, but in the few minutes it takes the new instances to provision, you may hit 100% and suffer a performance impact. You may want to scale at 50% CPU, but at that point you’re ‘wasting’ half your CPU cycles. You may want to scale up when a queue of work reaches a critical level – in which case you may need to write a monitor program to keep an eye on the queue. There are lots of implications here. You may even have great results with one set of thresholds one week, then a team will deploy changes and suddenly the app is no longer happy. This one is, I find, tricky. So, thresholds, tenancy and instances. Three areas where they may be fat to be trimmed. How do we figure this out?

Monitoring

A new AWS service that has proved worth its weight in gold during this project is Cloudwatch Dashboards. By creating dashboards for functional areas, we’ve been able to profile the resource usage of our microservices and websites, and easily spot underutilized clusters. We can then, using the Robot Army, consolidate the Octopus Deploy roles together, change the instance sizes and tweak our thresholds to achieve more efficient resource usage in a matter of a few minutes. All that’s needed is to quickly test that the new cluster is responding correctly and flip a Route53 DNS record across. We can then tear down the old, underutilized cluster and enjoy the savings.

The Dark Side of the Dark Arts

Of course, there is a dark side to all this fat trimming, and that’s this: Fat can be insulation. Fat, in a way, can be a kind of armour. What do I mean by that? Well, it’s as it sounds. If you’re over-scaled, sudden spikes in traffic are less likely to catch you by surprise. You’ll pay a cost, but it may save your bacon one day. As I put it in an internal email recently:

…rightsized clusters have less fat to absorb occasional bullets.

Notwithstanding the fact that fat doesn’t really make you bulletproof, an extra layer of padding really can insulate you from the dangers of the world, with the obvious downside that it’s going to cost you more and people will talk about you behind your back when you mention it at DevOps conferences.

This last element of the process really does lie in recognizing where bullets may come from, and that, of all four dark arts, is the darkest.

To solve this, you really do need to be doing the whole cultural side of DevOps and not just being a DevOpsy tech team within a larger organisation. Your marketing team, for example, need to understand that new advertising splashes may result in traffic spikes, and they need to know who to contact when they schedule commercials in so that your customer-facing websites don’t crumble into dust as a result. Your developers need to know what a given Product Manager or Sales director -even one outside their area – has planned in terms of product launches and what their dependencies are. Your HR people need to know about Engineer and Developer workload and resource accordingly, your Upper Management needs to communicate strategic direction to everyone and – generally speaking – information needs to flow both fast and far.

Metrics need to flow back into the organisation from the delivery teams and strategic information needs to flow in the other direction. And everyone needs to understand that you’re all in the same boat and all rowing in the same direction will get you there faster.

And that’s the real trick to DevOps.

If you’ve solved that, well… you probably deserve a beer.        

 

* “Resilient” and “Reliable” may look like synonyms, but in this case, they mean two different things:

  • “Reliable means that in normal operation, the infrastructure is predictable and stable. 
  • “Resilient” means that when a failure happens, it can tolerate and recover.

Leave a Reply

Your email address will not be published. Required fields are marked *