How Startups should design solutions to scale
Scale, the dirty S word for companies who lack experience designing such systems. A word which can strike fear into the hearts of many and which can be done badly at either extreme. You can either not consider scale enough, in which case you might fail very early (or at least have a lot of rework to do later) but at the other extreme you can be so heart-set on scale that you waste time and money trying to make a perfect scalable system. I believe from experience that the ideal is somewhere in between these extremes and I want to lay out some principles below about how to decide on your design when it comes to scalability.
- The first and overriding principle is that you should be aiming to scale but not too much in one go. If you were buying a new house, you might get one with some extra bedrooms for any children you might be planning on but you wouldn't usually buy a house with 8 spare bedrooms in case you end up with 8 children. Why? Because firstly, you don't really know whether you will end up with 8 children and more importantly, there is an expense with buying too much scaling room and after all, you can upgrade your house later if you need to. I think this works in software. Having some breathing room and knowing that you can cater for the next 6 months to a year of hoped-for growth is great but you cannot predict the future.
- Technology moves on. You don't know what technology might readily suit your system in a year or 2 years time. NoSQL databases, new languages, special hardware, new caching systems, all can have a massive effect on your system performance whereas if you try and build in 50 years of scalability, you will base it on today's technology and spend millions building your house which will look out of date in 5 years time!
- You don't know how much your system will need to scale. We are producing a system that could ultimately be used by millions of people around the world but if I plan the system around that, I will be paying for a lot of redundancy that will not be needed either for a few years or perhaps ever. By imaging a very good case 6 months/12 months, I can plan for, say, 100,000 users and base my design on that. I don't need to squeeze every millisecond out of my database queries or multi-thread every single part of my system. At the moment, I don't even use memory cache because on hosted servers, memory is expensive and we wouldn't be able to cache very much of any use anyway.
- If you succeed, you will rebuild. As I read the other day on an article, Twitter, Facebook, Google have all had to re-factor their technology to suit their scale. Languages have changed, back-ends have changed, parts have been moved around to try and make the bottlenecks occur at places that are easy to increase like web servers. None of these people could have realistically built their original systems in the languages they now use. This might be because the new tech didn't exist back then but it might be that the overhead of the development work required just wouldn't have provided payback when the user base was small, ironically, it might have cause them to be failures instead of successes.
- Your design will change! We have a system with relatively few pages, few use cases and not many routes through but we have already changed our design in about 4 major ways inside a year. This has had knock-on effects on the parts of the system that are doing work but if I had spent ages designing a super scalable system in the early days, I might already have had to tear that down and start again with the new system.
- If you end up being successful, you can afford to rework it later. Rather than assuming you need to get the Rolls Royce before you are viable, buy an Audi and prove that you are a good driver. Once you succeed, take on more developers and start to improve things that need improving.
- Development cycles are much shorter than they used to be. Our system is relatively simple but if I had to pull out SQL Server and put in MySQL, it wouldn't actually take very long, perhaps a few days or weeks. We shouldn't fear rework and replacement systems - this is part of what we employ developers for.
- Try and identify areas that whose performance will decrease linearly and others that might have an avalanche effect - monitor all of these. A web server will roughly slow down proportionally to the number of connections made which will relate generally to the number of users. At the point that the performance becomes unacceptable, I can usually add another web server and this is usually easy enough. Other parts of the system are potentially more error prone. What happens if you exceed your service provider's bandwidth allowance? Do you get throttled and cause a massive drop in performance, caused by that one small request over and above the limit? You need to know about these hard limits because if the performance drops massively, people might start to leave your service.
- Learn what is easy to scale and what isn't. I recommend all web apps are designed to work in a server farm. This is either automatic with many cloud services (PaaS) but even if you have to create the farm yourself with 2 web servers and the farm server, this then allows you to increase web connections very easily. Databases are hard to scale so keep the database as slick and quick as possible to avoid this issue early on. Try not to perform any CPU intensive operations on the database server. There are ways to split and shard databases but these are best avoided since there are all kinds of dragons there.
- Don't worry. Stick with what you know, employ people for bits you don't know and learn from your mistakes. It is more important that your company deals with issues in a timely fashion much more than it is to never make mistakes. Learning from your mistakes should be done by asking why a mistake happened and what can be done to avoid or reduce it happening again (test cycles, checklists, 3rd-party verification, whatever...)