My firm recently wrapped up a project for an Austin-based sports media company that uses Amazon Web Services (AWS) exclusively for their infrastructure. Like many other small businesses and start-ups, using the cloud for their software deployments has been a no-brainer. The cloud makes capital expenditure a non-issue, eliminates risk, and reduces time-to-market. Big wins all around, and at a cost that compares very well with the build-your-own-server-farm alternative.

But that cost can creep up (or explode) over time, partly as a result of just how easy it is to create and expand in the cloud. When we got in touch with this client to discuss their challenges, the first thing that jumped out was the size of their monthly AWS invoice, which was much greater than their company size and traffic levels should have demanded. After some deliberation, we promised to work with them to cut at least 15% from their monthly bill.

That promise turned out to be far too conservative. In 10 days, we slashed 35% off their AWS invoice – and there’s still some fat to be trimmed.

How is it possible that a small company can have that much excess spending in the cloud? Isn’t the cloud supposed to help us be lean, on-demand, just-in-time, and efficient?

Yes, in theory. But…

“Small” expenses add up quickly without oversight

The cloud makes it very easy to add a server here, a server there, and another couple of servers in the corner over there – and before you know it, you’re staring at hundreds or thousands of dollars per month in expenses that weren’t part of the plan.

It used to be the case that getting some hardware required talking to the IT department, going through the budgeting process, and finally getting approval from at least one manager holding a purse. But when you have independent decision-makers (frequently, front-line engineers) responsible for their own server requisitioning – and nobody watching closely – it’s nearly guaranteed that waste will happen. What seems like “no big deal” to one engineer becomes a really big deal when multiplied by a dozen such engineers, a handful of contractors, and the four people in marketing who each need their own WordPress site.

Let’s put some hard numbers on this point to drive it home. As of this writing, a m3.large instance from AWS (for the uninitiated: a modest server) costs a little over $95/month. If you have just 5 engineers who each feel that they need 2 of these servers to do their jobs, you’ve just spent almost $1,000/month.

Or, more frightening, if just one engineer feels that it’s no big deal to run an r3.2xlarge instance (read: a beefy server, but not a supercomputer) for a testing environment, he’s just signed you up for a $500/month bill. Thanks to the ease of the cloud, he made this decision and executed on it in less than 5 minutes with no oversight.

Your employees aren’t malevolent. They see a cost like “$0.70 per hour” and think to themselves, “70 cents is easily justified for this important project.” There are some who even multiply the numbers out and still say, “500 dollars per month is a no-brainer for this critical test environment.”

And they might be right sometimes. But at the point that your small team has collectively dropped $10,000/month on “critical” hardware, are you sure they’ve made the right call more often than not?

Developer incentives don’t consider efficiency

Put yourself in a developer’s shoes. Your bosses and peers measure your abilities on two dimensions: “How quickly do you solve problems?” and “Do your solutions work?” Thus, when faced with a decision like the following, which decision will you make?

  1. Spend $1/hour on a server that is 100% certain to be big enough for my project. Requisition it in 5 minutes and never think about it again, spending all of my time on beating the milestone.
  2. Spend $0.25/hour on a server that is 80% certain to be big enough for my project. On the 20% chance that the server isn’t big enough, do performance profiling to understand why, rewrite portions of the code, and/or go to my boss to discuss the tradeoff of spending more on bigger hardware versus spending more time on development.

The question is rhetorical, of course – you’d be a fool to not choose option #1 if there are no incentives to push you toward #2. Why risk your professional reputation for money that isn’t yours?

Making matters worse, once spending has started, there’s no incentive to back-track and eliminate the waste. Post-development incentives are tied to uptime, quality, and customer satisfaction, which are partial trade-offs against hardware spending. You want your servers to handle spikes in traffic. You want the servers to perform well for all of your customers. Nobody is talking to developers about cost efficiency, so nobody is going to down-size once the solution is in production. Even if the risk is small, why do it?

It all works well if somebody has a budget in advance, understands the technical limitations and cost/performance trade-offs, and is asking hard questions along the way. Without that person, the incentives you’ve given your team will in turn deliver ever-escalating costs.

A lack of cloud expertise causes suboptimal decision-making

While pricing structures for cloud providers are straightforward in the main, they’re full of intricacies and options that require expertise and a spreadsheet to optimize.

One example: AWS Reserved Instances (RI’s). These provide a discount over on-demand pricing for server time in exchange for a pre-purchase agreement for that time. At a glance, they’re simple enough – pay for a full month up-front, get a big discount. But many companies make a critical error here. They assume they’re purchasing blocks of instance-hours (e.g., 2 servers’ worth of time for 1 month), when in reality they’re purchasing discounts on concurrently-running instances (e.g., discount on up to 2 concurrently-running instances during the 1-month term). So if you run 4 instances at a time about half the days in a typical month, and buy 2 RI’s to cover that need, then congratulations – you’ve wasted a lot of money. Zing!

Another example: Elastic Load Balancers (ELB’s), which are AWS’s quick-fix solution for load balancing traffic across your web/application servers. Many developers assume that the pricing scales linearly with the amount of traffic served, and that’s true at the limit – but there’s a huge step in the cost graph for a load balancer with zero traffic whatsoever. The fine print here is that each ELB costs $18 per month regardless of its traffic load, and this is a common misunderstanding that leads to dozens upon dozens of unused ELB’s sitting in development, testing, and production accounts racking up hundreds of dollars per month in charges. Ouch!

These are just two examples. There are dozens of other common traps: Elastic Block Store (EBS)volumes and their (expensive) IOPS provisioning options, various instance sizes with differing cost-per-performance metrics, improper auto-scaling policies that can cause significant over-spend in the blink of an eye, etc.

All of this is to point out that, if you don’t have a cloud infrastructure expert, you’ve probably been making suboptimal decisions for a while.

You’re not alone

As I’ve mentioned, these are common problems for cloud infrastructure buyers, and I’ve seen massive over-spend in start-ups, small companies, and post-IPO tech organizations. So if you’re suspicious that your bill has grown abnormally large, your hunch is probably correct.

The good news is that, like the client I mentioned above, you can cut these costs very quickly with the right expertise and some tenacity. I’m happy to provide advice – free! – if you drop me a line, or contact me through my firm’s site.