Talk Slides: Distributed Computing, the CAP Theorem, and How to Improve System Architectures

Lots of companies - especially in the non-startup world - are starting to look closely at upgrading their legacy systems to the "next generation" - services, scalability, NoSQL, etc. Most of these systems have existed, in some form or fashion, for decades and are beginning to impede the business' ability to handle new customer demands - especially around time-to-market and ultra-slow workloads that are experiencing poor performance.

Whether you are creating a new, distributed, architecture or simply improving an existing slow process, there are complexity concerns that you will have to deal with. It's better to understand these issues up-front and make accommodations for them before you get blindsided in the middle of a long-term project.

In the talk below, Nathan and I discussed some of the basics around distributed computing, architecture, and storage and introduce some of the issues and constraints around creating the next-generation architecture for your organization that will sustain you through the next decade.

CAP Theorem: Revisited

Note: Close to two months ago, I wrote a blog post explaining the CAP Theorem. Since publishing, I've come to realize that my thinking on the subject was quite outdated and is no longer applicable to the real world. I've attempted to make up for that in this post.

In today's technical landscape, we are witnessing a strong and increasing desire to scale systems out when additional resources (compute, storage, etc.) are needed to successfully complete workloads in a reasonable time frame. This is accomplished through adding additional commodity hardware to a system to handle the increased load. As a result of this scaling strategy, an additional penalty of complexity is incurred in the system. This is where the CAP theorem comes into play.

The CAP Theorem states that, in a distributed system (a collection of interconnected nodes that share data.), you can only have two out of the following three guarantees across a write/read pair: Consistency, Availability, and Partition Tolerance - one of them must be sacrificed. However, as you will see below, you don't have as many options here as you might think.

CAP Theorem Overview

  • Consistency - A read is guaranteed to return the most recent write for a given client.
  • Availability - A non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).
  • Partition Tolerance - The system will continue to function when network partitions occur.

Before moving further, we need to set one thing straight. Object Oriented Programming != Network Programming! There are assumptions that we take for granted when building applications that share memory, which break down as soon as nodes are split across space and time.

One such fallacy of distributed computing is that networks are reliable. They aren't. Networks and parts of networks go down frequently and unexpectedly. Network failures happen to your system and you don't get to choose when they occur.

Given that networks aren't completely reliable, you must tolerate partitions in a distributed system, period. Fortunately, though, you get to choose what to do when a partition does occur. According to the CAP theorem, this means we are left with two options: Consistency and Availability.

  • CP - Consistency/Partition Tolerance - Wait for a response from the partitioned node which could result in a timeout error. The system can also choose to return an error, depending on the scenario you desire. Choose Consistency over Availability when your business requirements dictate atomic reads and writes.

CAP Theorem Trade-offs CP

  • AP - Availability/Partition Tolerance - Return the most recent version of the data you have, which could be stale. This system state will also accept writes that can be processed later when the partition is resolved. Choose Availability over Consistency when your business requirements allow for some flexibility around when the data in the system synchronizes. Availability is also a compelling option when the system needs to continue to function in spite of external errors (shopping carts, etc.)

CAP Theorem Trade-offs AP

The decision between Consistency and Availability is a software trade off. You can choose what to do in the face of a network partition - the control is in your hands. Network outages, both temporary and permanent, are a fact of life and occur whether you want them to or not - this exists outside of your software.

Building distributed systems provide many advantages, but also adds complexity. Understanding the trade-offs available to you in the face of network errors, and choosing the right path is vital to the success of your application. Failing to get this right from the beginning could doom your application to failure before your first deployment.

2014 Cloud Scale Challenge

How do you get around 20 people to invest crazy amounts of personal time and energy into learning technologies that are important to know, but do not directly impact their day-to-day work lives? Turn it into a competition!

Over the past several months, eight teams from the Pariveda Solutions Dallas office have worked tirelessly to design and implement a cloud-based architecture that a perspective client could be interested in, today. Each team had to meet a minimum set of requirements (more on that below), as well as implement their cloud architecture in both AWS and Microsoft Azure. We called this event the 2014 Cloud Scale Challenge - the first of its kind.

Last week, we hosted Demo Day where each team had an opportunity to show the group what they built and try to beat the competition in a measure of cost vs. performance. We had some interesting solutions, approaches, and lessons learned that are outlined below.

2014 Cloud Scale Challenge Demo Day

The Rules

The goal of this competition was to have each team design a simple E-Commerce application that can process large transaction volumes. This involved creating a suite of services to power the application, as well as a simple front-end to demonstrate functionality. Finally, each team had to come up with their own method of proving performance.

Each team was required to successfully meet these requirements on both AWS and Azure, but they get to choose which solution to use during the competition. To make things even more interesting, each cloud solution had to consist of at least three services from each provider (EC2, S3, etc.)

User Features - Create a website that performs the following actions

  • List products (100+ unique SKUs)
  • Search for a product by name or SKU
  • Create an order
  • Add an item to an existing order
  • Submit order

Service SLAs- The system must handle the following request volumes

  • 10,000 concurrent product search requests (10,000/second)
  • 30,000 add item requests per minute (500+/second)
  • 3,000 submit order requests per minute (50+/second)

Performance SLAs

  • All orders and line items must persist to the backing store within 10 seconds of the request
  • OK response must come within 500 milliseconds of the request
  • Search results must return to the user in less than 1 second

The Scoring

The team that wins must have the best score in two of the three categories above. In the event of a tie, the leading teams must compete with their other solution.

Demo Day Rules

Technology Choices

I was surprised to find that the teams were split down the middle between competing with their AWS and Azure solutions. We were right at 50%. Here are the unique technology stack decisions made across the competition.

  • Azure - PaaS - C# - Windows
  • Azure - PaaS - Node.js - Linux
  • AWS - Java - Linux
  • AWS - Node.js - Linux
  • AWS - Node.js - Windows

Three of the teams ended up starting with one technology stack, and then completely pivoting to a different language and OS midway through the competition because they weren't getting the results they expected.

Architecture Approaches

As you might expect, the architecture implementation a team ended up with was the key driver of success or failure in this competition. Any software developer on the planet can create a good enough solution and throw tons of computing power at the problem to meet their performance demands - a strategy that will bankrupt the infrastructure budget of even the largest companies. However, it takes a skilled cloud architect to analyze the various trade-offs each decision incurs and strive to make lots of really good small decisions to succeed.

Here's the AWS architecture that one of the better-performing teams put together - in both cost and performance. 2014 Cloud Scale Challenge Demo Day Architecture

Lessons Learned

This was my favorite part of the day. Each team had a wealth of knowledge and experience to speak about that didn't exist just a few months prior to the competition. Everyone unanimously agreed that they are much more comfortable with cloud technologies, and feel confident they could join any cloud-based project and hit the ground running. Mission accomplished.

In addition, there were some more nuanced lessons learned that are outlined below:

  • The teams had a unique opportunity to create a solution from scratch: File->New Project. This is something that some of them haven't done in a long time due to working on larger, more established projects.
  • The knowledge gained with one cloud provider (i.e., Azure) does not necessarily help you in the other.
  • Teams that started focusing on Azure‚Äôs Platform as a Service (PaaS) tended to use their Azure solution for the competition.
  • Creating a solution that works across both cloud providers with minimal code changes, restricted teams from using services that are unique across a single provider (PaaS on Azure).
  • AWS charges you for the whole hour, even if you spin an EC2 instance up for just a few seconds.
  • The team that had the best raw performance focused intently on the low-level aspects of their solution: frameworks, web servers, operating systems, and programming languages. They didn't pick what was familiar to them, they picked what would perform best.
  • Scaling out (adding more machines) is more effective than scaling up (adding more compute resources).


When we set out to create this competition, our goal was simple. Incentiveize people to get up-to-speed on technology platforms that are desperately needed in the industry. We didn't know what the response would be, how applicable the experience would be, or even how many teams would be able to finish.

At the end of demo day, we felt like the results far exceeded our expectations. The teams impressed us with their tenacity and intuitive problem solving abilities, and there was a high level of excitement and passion around the subject area.

We consider the 2014 Cloud Scale Challenge an overwhelming success and I can't wait to see what the teams come up with next year.