Although our user facing site is the most visible part of what we do, it's only half the story. The other half takes part in asychronous queues, which work overtime behind the scenes, and process hundreds of millions of tasks each year. In this post I'll explain a bit about why we use queueing at 99designs, and how it all works.
A Bit of Background
If you've never heard of asynchronous task queues before, the idea behind them is pretty simple. Say you had a task you needed to do such as buying some milk but you didn't have the time to take care of it yourself, so instead you leave a note for a friend/spouse/roommate asking them to do it for you when they have a chance. Congratulations, you've just implemented an asynchronous task queue.
Why We Use Queues
Now obviously our web apps aren't busy ordering milk, more common uses for a queue are things like talking to third party API's, sending emails, or performing computationally expensive tasks like image resizing. But why do we need a queue at all? Wouldn't it be easier to just do the work a user requires immediately? Well, there's a few reasons:
The first reason is speed: When we're talking to a third party API we have to face reality; unless that third party is physically located next to our infrastructure, there's going to be latency involved. All it would take is the addition of a few API calls and we could easily end up doubling or tripling our response time, leading to a sluggish site and unhappy users. However if we push these API calls into our queue instead, we can return a response to our users immediately while our queues take as long as they like to talk to the API.
The second reason is reliability: We don't live in a world of 100% uptime, services do go down, and when they do it's important that our users aren't the ones that suffer. If we were to make our API calls directly in the users requests we wouldn't have any good options in the event of a failure. We could retry the call right away in the hope that it was just a momentary glitch, but more than likely we'll either have to show the user an error, or silently discard whatever we were trying to do. Queues neatly get around this problem since they can happily continue retrying over and over in the background, and all the while our users never need to know anything is wrong.
The final reason to use a queue is for scalability. If we had a surge in requests that involved something CPU intensive like resizing images, we might have a problem if all of our apps were responsible for this. Not only would the increased CPU load slow down other image resize requests, it could very well slow down requests across the entire site. What we need to do is isolate this workload from the user's experience, so that it doesn't matter if it happens quickly or slowly. This is where queues shine. Even if our queues become overloaded, the rest of the site will remain responsive.
How We Implement It At 99designs
Understanding why we use queues is one thing, actually implementing it is quite another. In the case of 99designs we chose beanstalk as our queue, it's key features being performance, reliability and simplicity. Rather than having just one centralized queue which our servers push tasks to, we instead have a separate queue on every app server. Queuing a task locally is always going to be very quick, and having multiple queues gives us a healthy dose of redundancy in the event of a beanstalk failure — something I've yet to see.
Of course on its own a queue doesn't do anything; simply pushing tasks onto it doesn't magically make them happen. For that you need a worker. Since a lot of what 99designs does is PHP-based we had to develop our own custom worker daemon, as there weren't any open source solutions for beanstalk. Each worker node maintains a pool of worker processes to connect to each queue and listen for new tasks. In the event that our queues start to back up, we can easily launch a new worker instance thanks to AWS and have it processing tasks within minutes.
Our one-queue-per-app model
The final part that ties all of it together is how we deal with failures. In the event that a task fails it doesn't simply disappear into the ether. Instead we'll release the task with a delay, essentially putting it back onto the queue after a certain amount of time has passed. If a task continues to fail we use an exponential backoff strategy to prevent failing tasks from clogging up our queues. If after multiple releases the task still has not finished successfully then we "bury" it. Buried tasks are then manually inspected at a later date, and a decision is made to either fix the problem (if possible), or delete the task if it is deemed unrecoverable.
Work in progress
There's still more that we'd like to do with our queueing systems. If too many tasks fail, and are buried, we get alerts which wake us in the night. Sometimes, this happens because an external service we use goes down, something we can't do much about anyway. Unfortunately, our tasks don't yet coordinate their back off in any way; each assumes it's the only one failing. Ideally, we'd track failures against external APIs, and simply delay groups of tasks for longer, giving third party services time to recover.
Asynchronous tasks are a crucial, and often overlooked, part of how any large site operates. We hope this post gives some insight into why queues are important, and how they can improve your site's performance and reliability. Whilst there are many different queuing architectures possible, we also hope our recipe serves as a useful pattern for other sites.