Launching a High Performance Django Site

Are the brakes on your Django app?

When building an application using an application framework like Django... the priority is often to get the application working first and optimize it later. The trade off is between getting it done and getting it done for 1 million users. Here's a check list of things you can do to make sure your application can be optimized quickly when you put on your optimization hat. Note, most applications don't need all of this since most applications do not get anywhere near enough traffic to justify even bothering. But if you're lucky enough to need to optimize your Django app, I hope this post can help you.

Note, my background is in building very large high traffic sites for companies such as Fanball.com, AOL Fantasy Sports, eBay.ca, PGATour and NASCAR. All of those sites were built using ColdFusion/Microsoft SQLServer or MySQL or Oracle and I only recently jumped into Django. If you're familiar with fantasy sports, you know that you usually rush to a site to set your line up just before a sporting event starts and then you check your score when an event is live. This traffic is extremely high during those peak times, so much so that Fanball.com used to be a top 1000 site according to Alexa during football season. Being new to Django I wished I could have found a post like this when trying to launch a high traffic site.

Caching and Managers
You probably built your app using managers right? Even if you're not building the application for a large number of users, you should be using managers anyway to help reuse code. If you are using managers, going back to retrofix your code to use memcache or some other type of caching is straight forward. Be sure to use cMemcache in your production environment. And be sure NOT to mix cMemcache and the regular memcache python library. The keys they generate are not 100% compatible. You'll be able to read keys from either, but cMemcache won't write keys that regular memcache can always read. I'm not sure why that's true, but you've been warned.

Dog Piling and Caching
This is a big deal. No matter how good you're caching objects, you need to make sure only ONE process is refreshing the cache. There are several ways to handle this. Mint Cache is a good solution. Another solution is to use managers that ONLY read the cached objects, and have a separate process refresh the cache. You can use signals to flag that an object needs to be refreshed. Or you could refresh it on a timed interval.

Health Check for Load Balancers
You have two options to survive a server going down. One is to have a hot spare waiting to be put online. In this scenario the load balancer should do this automatically. Another, is to have that hot swappable server already online and make sure that your load can always be handled with at least 1 server down. I prefer the later solution as this guarantees that the "idle" server is actually functional under load. Opinions vary. Avoid auto shutting down a server from a load balancer. You could cause a death spiral very easily this way.

Access Servers Directly
It's important to be able to access the servers directly, even if you are behind a load balancer. This will often be the only way to reliably test unusual problem that might be happening in production. To do this, make sure the servers are configured to answer at a special URL directly. For example web1.mysite.com. You should have web1.mysite.com in your HOSTS or in your local DNS server. If you can't get to the server directly because of a firewall, try using a free VPN like Hamachi to get through.

Connection Caching
Be sure to have some sort of connection pooling or connection caching. I was able to get SQLAlchemy installed within 30 minutes. It's easy to setup. Make sure that the database timeout value (base.py) in the SQLAlchemy matches the timeout value for the connection at the database.

Avoid Thread Thrashing
Keep those threads alive! Check your Apache settings and make sure that you don't have thread thrashing. You don't want Apache killing and starting threads on you. Every time you do that, Django needs to initialize... an expensive process. Any objects you might have built in memory need to be rebuilt. The database connection needs to be established again. These are all things you only want to do once. Would you boot up your machine every single time you want to send an email? You probably leave it on during working hours. Same thing applies here. Leave those threads on. Here's an example of settings I've used with Apache on a 4 CPU server.

StartServers 20
ServerLimit 20
MinSpareServers 20
MaxClients 20
MaxRequestsPerChild 100000

Note that MaxRequestsPerChild COULD be set to 0 and have the thread never reset, but just in case there's a memory leak somewhere (I have yet to see one) I have it reset every 100k requests. Don't just set the ServerLimit and MaxClients connections to some crazy high number. Remember, there's only so much memory on the server to go around. If you start to swap memory, your server is dead. Additionally, if your server is already CPU bound (85% CPU utilization), setting these numbers higher is not going to help. You'll just increase the overhead of switching between all the processes.

Cache Templates
In a production environment, you should cache templates.Here's a great snippet to do that.

Note About Load Testing
There are two ways to test how your application is going to perform under load. The more expensive, more time consuming and least accurate number can be had with load testing software. You can start to unit test certain parts with software as simple as ab which comes with Apache. You can start to spend some real money on expensive load testing suites with nice reports and that allowing load testing from multiple clients so that you can effectively test against a load balanced farm. Load testing is a completely different test, but just remember that load testing is in fact the least accurate and the most expensive way to load test an application. You can spend lots of time and money building script that try to get close to a real world scenario but will never actually be real. The advantage is that you will get back some useful data and you can do this before a single user hits a web page. A better approach, if it's possible, is to gradually roll-out the application. Gradually increase both the number of users and the number of expensive features. This is not always possible, but usually.. it is.

Reverse DNS and Mutexes
This might sound obvious, but be sure your DBA has checked this one. MySQL likes to do reverse DNS lookups on the IP when it receives a connection. Either start MySQL with --skip-name-resolve or be sure that reverse DNS is configured properly. Also, if you're going to have a large number of connections (probably one per apache thread + a few extra) be sure the mutex count in the OS is set high enough. We've had to raise it to 1000 on a very large installation.

Miscellaneous
Here's a few things covered in other posts, but that I feel I need to include in here because.. well. it's very low hanging fruit. Remember to reduce the number of queries. If you're doing something like this in a template team.player.name and you're not using select_related() or not creating your own object, that means that django will automatically query the data for you. This is a huge problem if it's in a loop. Additionally, try to combine ORM calls if it makes sense. Try to go to the database as little as possible. It's often easy to browse through the queries on a page and see where often used lookups can be cached or different ORM calls can be combined.

Be sure to monitor your disk space. You probably want to turn of most logging on Apache.

Also, please don't serve your images and static content through Python. It's like using your flat bed to transport a letter.

Conclusion
You've already chosen Django and Python so you know you have room to improve performance. Plan ahead. You don't have to slow down development to optimize the application for those mythical million users, but use managers whenever possible. Keep an eye on that query count at the bottom of the page.

If all else fails and you're under the gun, ask. The IRC channel can really help when things are happening right now.