Large Databases, Primary Keys, Replication and Pain
So I am thinking of some fairly fundamental issues with the design for my new project which could scale to millions of users and 10s of millions of database rows. I don't even know at this point (and don't want to decide) how to handle it when I am running at the limits of the system but I also don't want to spend lots of time and money until I know the system is popular and is being used enough to make it a problem.
So I need to make some smart choices now that might help later even though I cannot tell what will happen. One of these is database primary keys.
A database primary key is very powerful, it is by definition unique and provides a way to uniquely identify a row in a database table. You don't have to have one but in most cases, it is a no-brainer. In many cases, it is a simple integer value that auto-increments every time a row is inserted and which can be used easily in SQL statements like "UPDATE table_name SET column_name = 'something' where id = 123". Of course, this would work without a primary key but the primary key does something significant, it defines how the data is ordered in memory or on disk. This has an affect on both the write time and the read time.
Imagine I am looking for a house number on a road. If the numbers are ordered, as they usually are, it is very easy to find the house (usually!). In the same way, if the database is fetching a record with a given primary key, it can very quickly find it since the data is ordered by the primary key. Imagine if the house numbers were not ordered? You could find it but you would on average have to travel half the length of the road, checking all houses on the way, to find any given house.
For this reason, it should be easy to see why an auto-increment integer works well. It is a simple value to find but also, it is by definition ever increasing so new rows are always added to the end of the existing data. If this was not the case, the database would have to re-order data to accommodate the new rows. For larger tables, this additional write time could be minutes or hours!
So we are happy? Nope, there is something that an integer with its predictability is not good at. Replication. If you have multiple master databases (data can be written to more than one database) but you want that data to be copied across the other databases, you have a problem (usually). You end up with two different rows with the same primary key value in two separate databases. What do you do? You have to copy each unique row to the other database but it will end up with a different primary key value. Row A could be 1 on database X and 2 on database Y. Conversely Row B could be 2 on X and 1 on Y, how confusing! If you have lots of related data in other tables, these also need to be replicated and have their foreign keys modified in the process. What it means is that the master databases are not actually exact copies and you cannot just swap between them from the web servers because their ids will not match.
Of course, you can sometimes avoid these problems (and if possible you should!) by having either completely separate databases that don't replicate (perhaps split by geographic region) or otherwise direct all writes to a single database and then copy across many read-only databases. This way, the replication is only ever in one direction and there will never be conflicts but it does imply a lot of work being done by a single database and that might present very real performance problems. You could also mitigate the problem by seeding the primary keys with different values on different databases so that database A starts at 1 and database B starts at, say, 1000000 so that theoretically the ids will never conflict but again, that is an assumption and if database A ended up with a million rows, you would be stuffed.
This brings us to an interesting alternative to integers: GUIDs Globally Unique Identifiers that use the date and time and some hardware information to generate theoretically unique values. In fact, most are not guaranteed to be 100% unique so you still need to handle the edge case where a duplicate occurs but that might be easier to deal with. Replication becomes easy because for the most part, different rows will always have a different primary key, they will be unique so they are usable for a primary key to uniquely identify a row and although some people seem to dislike the idea of guids in URLs (for some reason), they do solve the replication problem.
So we should just use GUIDs? Eh, no! Guids bring their own problem. Although they are unique (more or less!), they are NOT sequential. What does this mean? When we insert a new row into the database, we would almost always have to re-order the database table on disk to accommodate the new GUID id in the correct order. This is very slow and would be like building a new house in the middle of a road and having to renumber every other house to keep the ordering correct! So GUIDs mostly solve replication issues but are not actually Primary Key friendly on any table that has more than a couple of hundred rows (and not even then if the table is heavily used). Guids also take up more database space (either 36 characters if stored as a string or 16 if stored in raw binary compared to a 4 or 8 byte integer) which kind of adds up over time.
So are we stuck between a rock and a hard place? Well there is one other alternative that merits mention and that is sequential ids. They are still supposed to be globally unique (more or less!) but the generation puts the date and time as the first part of the identifier so that newer ids are always sequentially higher in value and will therefore work well with primary keys but their unique nature makes good work for replication also.
So we should just use these sequential ids? Not necessarily. If you are never going to replicate in two directions across databases, just stick with the integer, it is fastest and takes up the least space in the database. If you will (or might - gulp) have to replicate in two directions, sequential ids are a good choice but there are implications. Firstly, SQL Server for instance provides a method (NEWSEQUENTIALID), but this exists in the database layer and one of the cool things about guids is you can generate them in the business layer or client without having to hit the database first so you could alternatively use another way to generate these sequential ids in code but do your homework and check how unique they are and decide how you will detect and repair a duplicate guid. Perhaps you could do something hybrid and use an id and an integer to permit a few duplicate ids. The primary key would use both - not sure if that's a good idea but it avoids the need to detect duplicates.
So I need to make some smart choices now that might help later even though I cannot tell what will happen. One of these is database primary keys.
A database primary key is very powerful, it is by definition unique and provides a way to uniquely identify a row in a database table. You don't have to have one but in most cases, it is a no-brainer. In many cases, it is a simple integer value that auto-increments every time a row is inserted and which can be used easily in SQL statements like "UPDATE table_name SET column_name = 'something' where id = 123". Of course, this would work without a primary key but the primary key does something significant, it defines how the data is ordered in memory or on disk. This has an affect on both the write time and the read time.
Imagine I am looking for a house number on a road. If the numbers are ordered, as they usually are, it is very easy to find the house (usually!). In the same way, if the database is fetching a record with a given primary key, it can very quickly find it since the data is ordered by the primary key. Imagine if the house numbers were not ordered? You could find it but you would on average have to travel half the length of the road, checking all houses on the way, to find any given house.
For this reason, it should be easy to see why an auto-increment integer works well. It is a simple value to find but also, it is by definition ever increasing so new rows are always added to the end of the existing data. If this was not the case, the database would have to re-order data to accommodate the new rows. For larger tables, this additional write time could be minutes or hours!
So we are happy? Nope, there is something that an integer with its predictability is not good at. Replication. If you have multiple master databases (data can be written to more than one database) but you want that data to be copied across the other databases, you have a problem (usually). You end up with two different rows with the same primary key value in two separate databases. What do you do? You have to copy each unique row to the other database but it will end up with a different primary key value. Row A could be 1 on database X and 2 on database Y. Conversely Row B could be 2 on X and 1 on Y, how confusing! If you have lots of related data in other tables, these also need to be replicated and have their foreign keys modified in the process. What it means is that the master databases are not actually exact copies and you cannot just swap between them from the web servers because their ids will not match.
Of course, you can sometimes avoid these problems (and if possible you should!) by having either completely separate databases that don't replicate (perhaps split by geographic region) or otherwise direct all writes to a single database and then copy across many read-only databases. This way, the replication is only ever in one direction and there will never be conflicts but it does imply a lot of work being done by a single database and that might present very real performance problems. You could also mitigate the problem by seeding the primary keys with different values on different databases so that database A starts at 1 and database B starts at, say, 1000000 so that theoretically the ids will never conflict but again, that is an assumption and if database A ended up with a million rows, you would be stuffed.
This brings us to an interesting alternative to integers: GUIDs Globally Unique Identifiers that use the date and time and some hardware information to generate theoretically unique values. In fact, most are not guaranteed to be 100% unique so you still need to handle the edge case where a duplicate occurs but that might be easier to deal with. Replication becomes easy because for the most part, different rows will always have a different primary key, they will be unique so they are usable for a primary key to uniquely identify a row and although some people seem to dislike the idea of guids in URLs (for some reason), they do solve the replication problem.
So we should just use GUIDs? Eh, no! Guids bring their own problem. Although they are unique (more or less!), they are NOT sequential. What does this mean? When we insert a new row into the database, we would almost always have to re-order the database table on disk to accommodate the new GUID id in the correct order. This is very slow and would be like building a new house in the middle of a road and having to renumber every other house to keep the ordering correct! So GUIDs mostly solve replication issues but are not actually Primary Key friendly on any table that has more than a couple of hundred rows (and not even then if the table is heavily used). Guids also take up more database space (either 36 characters if stored as a string or 16 if stored in raw binary compared to a 4 or 8 byte integer) which kind of adds up over time.
So are we stuck between a rock and a hard place? Well there is one other alternative that merits mention and that is sequential ids. They are still supposed to be globally unique (more or less!) but the generation puts the date and time as the first part of the identifier so that newer ids are always sequentially higher in value and will therefore work well with primary keys but their unique nature makes good work for replication also.
So we should just use these sequential ids? Not necessarily. If you are never going to replicate in two directions across databases, just stick with the integer, it is fastest and takes up the least space in the database. If you will (or might - gulp) have to replicate in two directions, sequential ids are a good choice but there are implications. Firstly, SQL Server for instance provides a method (NEWSEQUENTIALID), but this exists in the database layer and one of the cool things about guids is you can generate them in the business layer or client without having to hit the database first so you could alternatively use another way to generate these sequential ids in code but do your homework and check how unique they are and decide how you will detect and repair a duplicate guid. Perhaps you could do something hybrid and use an id and an integer to permit a few duplicate ids. The primary key would use both - not sure if that's a good idea but it avoids the need to detect duplicates.