Another problem in production

It’s almost a trope of Development that things are “fine” in development and then fail in production but of course, this is true for a number of reasons. Environmentally, things can be different but often it’s simply that something was development on a smaller dataset or smaller tables than a production database and therefore the thing that was “fine” was not actually fine, it’s just that its problems were not amplified enough in development to notice.

In our case, a query on survey results for one customer took 41 seconds, which was longer than the query timeout and he saw an error. Unfortunately, we didn’t have enough logging to definitively work out what was wrong and trying the same thing in our DR system didn’t exhibit the same problem (although that was a reminder that we needed to make sure that the DR system was setup exactly the same and it wasn’t).

It is a little frustrating that we embarrassed ourselves with something that has happened enough times before. Complex query = you must test this in production in a large data set. Part of the problem with agile development happening in independent teams is that it is hard to centralise the things we learn in a way that gives groups autonomy to work in a way that best suits their domain without looking like you want to control things or you don’t trust Devs to do things correctly even if they don’t!

The query problem

Almost always, a query that works “well” in development but not production is probably using some kind of scan or key lookup which works acceptably well in a small scenario but does not scale well. A scan works well enough on tables with relatively small numbers of rows but with scale linearly with table size (e.g. a table with 100K rows vs 50M rows will scan in 500 times less time). A key lookup works in conjunction with an index seek and, again, works OK if the number of matched rows is low but increases significantly if it is more than perhaps 10s of results.

The correct practice for any new or modified query is to get the query plan beforehand (if modifying) and then examine the query plan afterwards. In some cases you can run the old and new queries alongside each other and check for their relative time for execution. If the new one is much more of the relative time or you are creating a new query completely, you need to be able to read and understand a query execution plan. Don’t be fooled! This can be different in development than it is in production but is likely to be the same if the data is spread in a similay way between dev and prod otherwise you should test it in prod. Parameter sniffing for stored procedures can affect this significantly.

What are we looking for? Apart from general complexity and parts of the query which seem to be taking a large part of the percentage (unless they are the bulk of the query), we are looking for scans and key lookups.

Seek = good

If you are querying from a table by an indexed column and all of the returned columns are included in the index, you will get an index seek – super fast and mandatory for querying very large tables.

Scan = bad

If you are querying from a table by an un-indexed column then the analyzer will usually select the smallest index that contains all of the required columns (ultimately the clustered index if no others are suitable) but will need to do a scan since the query column is not in any order, to evaluate where mycolumn = something, I would need to read every single row to resolve that. If I use a top statement then I might be able to finish early but only if I can quickly find the correct number of rows, if there aren’t enough to meet the top statement, I still have to read all of the rows to work this out. Scan is bad unless you have a very small table but even in that case, if it is queried often, why not add relevant indexes, which will also be small and which create good habits!

Seek + Key Lookup - be careful!

The scenario to look out for is when there is an index on the column you are querying but it doesn’t include all of the columns that you want to return. In this case, one of two things will happen: 1) The analyzer will skip that index and simply scan the clustered index or 2) It will use the index for the lookup (a nice quick seek) and then do a key lookup for the other columns which means taking the implicit primary key from the index and doing a single row query on the clustered index for those remaining columns. What determines which one it chooses? Statistics!

Statistics

Statistics in SQL server are a way to give a hint to the query analyzer about the distribution of data in a table by the indexed columns. You automatically get them for any indexes but you can also add your own. Basically think of them as various information about the distribution of data in a table. For example, what is the average number of rows that include a particular column value? What is the relative count of one column with a particular value being in a row with another column having another value. You can read the details in the link but the question the analyzer is asking is: Are there few enough rows that an index seek with an (expensive) key lookup will be faster than a clustered index scan. The key lookup is like another separate query for each matching row so it does quickly get slow.

What is slightly limiting is that the space available for statistics is very small so only a very simple representation is possible.

Our bad query

We had a query that when joined against an empty (or small) table in development, the key lookup didn’t add any appreciable latency and the query was basically instant. The same query run in production where the table being joined only had 6000 rows in it (not crazy right) caused the query to take 41 seconds as mentioned above. This gives you some idea about the potential cost of the key lookup even on a relatively small data set. Note that we were only returning 50 rows of these ultimately but due to the way the where clause worked, the key lookup was done against all rows before they were then filtered out.

Why was it choosing the key lookup? Because the statistics for the index it was using said that the average number of rows per input table row was low and this was true statistically (the “mean” was low). But that is basically like saying that I will assume everyone in the world earns the average of $1,500/month even though a multi-millionaire might earn 100s of 1000s or millions per month. The mean hides any extreme distributions and in our case is what was happening. Some input rows would have no matching rows and others could have 1000s so we can’t really blame the statistics for not being able to work this out.

What could we do?

  1. The first and easiest solution, if possible, is to change the query so that it does not need the key lookup. This means either removing some of the columns you are retrieving so that it can hit the index it is seeking with or in our case, changing the join which was joining unnecessarily on two columns, each indexed in separate indexes. This actually reduced the time from 41 seconds down to pretty much instant by removing the key lookup! Interestingly there is still another key lookup but due to the small number of rows related to that part of the query, it was not a problem.
  2. Secondly, you could consider adding an index that does cover the columns you need and includes what you need to return. It might have a significant cost in terms of storage depending on the size of the table and a generally small performance hit on table updates (which would usually only be an issue on very high traffic tables). You can mitigate the space, if possible, by using a filtered index which essentially means you index on a particular column value e.g. where archived = false. In some cases this is easy where there are a large number of rows that you know will never be returned from your query and could reduce the index size by 10s of percent.
  3. Thirdly, you could do something more clever to make the query analyzer have an easier time. In our case, we have an options table for questions. Each question has 1 or more options and only the text ones might have some sentiment analysis done on them. This means that the sentiment table does not have a normal distribution of optionids. It will have lots for some options and none for others. One option, therefore is to add a column to the options table like has_sentiment which is set to true for any text options and which can form the basis of a filtered index, this won’t necessarily stop the key lookup but will potentially reduce the amount of memory and time to join the tables. Secondly, you could create a joining table of optionid_sentimentoptionid and create a separate foreign key in the sentiment table. Now, only sentimentoptionids are present in the sentiment table which will give a much more accurate distribution statistic since it will only relate to options that can actually have any rows in the sentiment table and not just all rows from the options table. This is obviously more subtle but in some cases might be your only option.
  4. Sometimes you can take a much simpler approach and retrieve your data in two (or more) distinct queries rather than trying to be clever. For example, imagine you have a one to many relationship between users and payments and want 1 row per user with their latest payment date, latest payment amount, greatest payment amount etc. You could do this with sub queries, with groupings or with ctes but it might just be easier to query all users and then have a separate query where you select the last payment for each user and union this with the greatest amount for each user and then either join these separately in sql or do it in the client.

The query analyzer is pretty amazing but it has limited information to work with and sometimes you have to get creative. You can spend a lot of time looking into optimising statistics but ultimately if the distribution of data is too varied, you probably won’t be able to get what you need and an index would be the simplest solution.

But you always need to test your queries in a large dataset and against a large database before accepting it!