Computer Student

Perfection is a Software Engineer’s worst enemy

2024-06-22T15:02:00+00:00

Perfection or Pragmatism

I have heard the quote that the correct people to employ are “smart people who get shit done” attributed to a number of people but I first heard it when reading about Stack Overflow (I think!) so I will attribute it to Jeff Atwood and judging by the search results in his blog, it’s certainly a phrase he uses a lot.

Now I have worked with both extremes: People who are super practical but lack the intelligence to “work smart” and perhaps produce a long-winded or inefficient solution to an already solved problem. I have worked with less people that are of the other extreme: super clever but lacking the urgency to get stuff done and out the door but still people seem to fall into one camp or the other.

Some people, however don’t fall into either camp: They are neither particularly clever or super-pragmatic but I think these are most people that I work with and have worked with. They are kind of reliable, they work on lots of similar Line of Business Apps and can achieve most things to an acceptable but not amazing level. They get stuff out the door but perhaps not as efficiently as they could do and they create solutions that are just “OK”.

However, these other people still suffer from what I think is a Software Engineer’s worst enemy: Perfectionism. It afflicts many other Engineering disciplines and probably other industries but I wanted to keep my scope to an area I am more experienced in and not blanket the entire world in my belief.

Code Detail or Scope Creep

I think this perfection comes in both deep and wide forms. The deep-form of perfectionism is someone spending far too long, for example, polishing a Regular Expression to get every ounce of performance out of it even though it is only used on the sign-up form. You know, someone wanting to save 2 ms on a form that is only posted once per customer and takes a second to actually execute. It’s also like articles on the “right way” to write your CSS even though in my many years of software, I have yet to find someone who has failed because they used pixels instead of rems or ems or whatever else some purist believes in. Most of the time, UX sucks because it hasn’t been thought about properly, not because the CSS didn’t fit a paradigm.

The “wide” form of perfectionism is when something that should be simple and quick turns into a wider and wider stab at covering every possible combination of everything “just in case”. You might have heard of the YAGNI principle i.e. You Aren’t Going to Need It, but I also see lots of people who don’t understand it. The counter to Yagni is “While you have the hood open…” but, again, we spend weeks or months building out things that haven’t been tested with our customers yet. I see apps like Slack and Teams releasing new super-complicated features that I struggle to imagine are going to be useful to any more than a tiny percentage of users but have now broken all of our UIs because every change is a breaking change.

Where to draw the line

Of course, if I say, “you’re taking too long”, “you’re making this too complicated”, your argument might be “I need to do it properly” so how do we articulate what is or isn’t the right level of detail or abstraction or scope?

The right amount of work is the minimum work I have to do to solve this problem in a way that performs appropriately well and is maintainable if we need to come back and make further changes

But the “OCD” kicks in easily: “You can’t just use an arbitrary ‘if’ statement to check for a specific account number!”. Why not? The code does what it says on the tin probably with a comment explaining why that account behaves differently. “What if another customer needs it?”. If they do I’ll either come back and refactor it or I will just add a second ‘if’ statement.

This is not saying that you don’t care but writing something that has not been abstracted is not lazy or lacking due care “just because”. If that is a quick and easy win for a customer to do something, deal with it!

“You can’t just use a string comparison, it is much slower than a regular expression”. If you can easily do the same thing with a regular expression, great, knock yourself out - being pragmatic doesn’t mean you can’t do something well by default but if that regular expression is going to take you 2 hours and the string compare 5 minutes, get over yourself unless this is code that is called 1000s of times a second and needs to be slick, then the appropriate level of performance is different.

Business is business

The world moves very quickly. The software industry changes quickly. Most of your code won’t live much longer than 10 years, if that, before someone decides on your behalf that Java is rubbish and Python is good; or server-side is 1990s and front-end code is the way ahead. Of course, all of these are nonsense because most frameworks can handle most scenarios without any particularly sticky problems so you are better off choosing what your staff are experts at and my own opinion that server-side apps are MUCH easier to reason, to debug and to maintain than any front-end frameworks.

Your business exists to pay bills and salaries and ideally, make enough money to make everyone’s life a bit nicer. Your business succeeding might be a nice pay-off when the company is bought. It might be nice offices, nice free coffees and fruit or the latest hardware to work with. Every time you get stuck in some kind of “only 100% is good enough” trap, then you are eating away at that success and taking away from everyone in the business for a piece of code that no-one cares about.

So get over yourself, stop fretting, and just get stuff done!

Approaching a performance problem

2024-06-01T15:59:00+00:00

Background

One of the common jobs of Developers is debugging something that is slow and hopefully being able to fix it. This is just such a tale but although the more senior of you might not learn much about this, those who are less experienced might well do!

We have an internal web application at SmartSurvey, which we will call Admin and which we use to manage user accounts including the various resources within their accounts. One of the relatively newer features is a page called Review Surveys, which allows you to see which paid features you would lose if you downgraded your account plan to a cheaper alternative. For example, if you want to use the Net Promoter Score question type, you need to be at least on the Business plan, the Professional and free plans don’t have the feature. If you reviewed a Business customer who wanted to downgrade to Professional, for example, it would flag NPS questions on any surveys that you use them on and tell you that this feature is available on the minimum plan of Business.

The problem was that this page is slow, VERY SLOW. In our development environment, we have a large account that has around 870 surveys. Not that many compared to some of our customers but enough to test a page like this that processes a lot of data. If you review surveys on this account in the debugger, it takes 2 minutes not just for the page to load the first 10 items in a listing control but also each time you move to another page of 10. If you try a similar thing in production, it will time out so it is basically useless on any large account.

Why things might be slow

Things like this are a red rag to a bull when I see them. They are just so wrong that I have to fix them even if no-one has asked me to (although I don’t know why they haven’t, it is so bad). The first question is really, can I easily find out why it is slow and if I can, can I benchmark anything that I can improve so I can see how much difference I can make?

There aren’t lots of things that can make something that slow. Ignoring weird or unusual causes like weird lock timeouts on a database, slowness might be number crunching (graphics, hard maths) but we don’t do any of that. It could be processing enormous amounts of data but with network and disk speeds so high, it would have to be enormous amounts to be noticeable in most cases (or maybe if we would doing it many times in parallel). Otherwise the three likely issues are 1) Excessive database query numbers 2) Very slow query performance 3) Poor utilisation of the correct collection types when processing data in-memory.

For 3), you might find it strange that this is an issue but I once sped up part of our system from 10 minutes to about 20 seconds by recognising that calling 1000s of collection.Where() in Linq was effectively iterating an entire list millions of times and although that work was not strenuous for the computer, it takes a finite amount of time to move through a list, perform a comparison and then store the result. Instead, I converted the data into a Dictionary of key => list and the dictionary lookups being super-fast made a massive difference.

Using the right tools

I have worked with some people who can answer a question like this very quickly, which we should be able to, but also with others who have all number of ad-hoc approaches. Some people just decide what they think something is and spend ages working through it, sometimes only to find their assumption was wrong. Others just jump between random things to try and spot something that looks out of place but clearly neither of these is that useful. Start with an easy way of ballparking something before getting into details. We can spot the number of, and performance, of database queries really easily in the SQL Profiler for SQL Server, which is what I started with since I suspected this was the most likely problem.

It goes without saying but being able to run the debugger locally and get a locally reproduceable issue is critical. If it is only slow in production, that can be a problem since then we have to make more guesses, perhaps add logging or something, re-deploy, try again etc. The speed that you can run up a debugger and spot the slow parts is a critical skill for a Developer.

The benchmarks

Fortunately, the problem was easy to recreate and we had a large enough account in our development database to see the problem from the debugger. 870 surveys on an acocunt and the page took 2m05 to load and made, wait for it, 7,600 database queries. Yes indeedy!

The initial one that actually fetches all of the feature information was quite slow (around 10 seconds) but that was subsequently cached so a relatively small hit to take in the scheme of things. No query was super slow other than that, it was sheer volume. Now you might think that database connections are fast, and they should be, but how fast is fast? 5ms? At 7,600 queries x 5ms, you are still looking at 38 seconds and clearly some of these were taking a little more than 5ms. YOu simply cannot run that high number of queries and get good response. If those queries were unavoidable, you should have an architecture that queues the work in the background and allows you to access the results later.

Caching can be a code smell since caching can solve a performance problem but at a cost of RAM and a risk of stale data. Neither was too significant in this case but it was a bit strange to see only one slow query result being cached and the rest of the slow stuff being executed each time. Even if the whole thing was cached though, the initial hit would still be far too slow and would betray a poor design choice.

I had a very low bar to start with so improving it would hopefully be quite easy. On the way, I get to interpret the mindset of the original author, who no longers works for us (not because of this mind you!).

My approach

When you have such an obvious cause, it is usually easy to make an initial approach. Look for something that looks repatative or unecessary and either remove it or optimise it in some way. Fortunately, the SQL Profiler made it really easy to see the worst two offenders for repetitive queries.

1) For every single survey and feature combination (1000s of them) a query is made to the database to fetch the most recent response on that survey for display purposes. If you have 40 features used on a survey that is 40 requests for the same information, all in the form of a DB query. Now this query requires a “TOP 1” and an “ORDER BY” so it is not a trivial query.

2) For every single survey and feature combination, the code would make a call to the database to basically ask, “what is the lowest plan that supports this feature”. In other words, if you had NPS questions on 100 surveys, 100 identical queries would be made to return the same information.

It turns out that with 870 surveys in this example, roughly 7000 database queries are made just for these 2 culprits. If we can reduce these, we can definitely speed up the page load.

Problem 1 - Get latest response

There is one simple solution for this, which escaped the original author and that is that we load the most recent response date for all surveys we are processing in one query! This does 2 things: firstly, we avoid repeating a query for the same survey maybe 10 or 20 times but also, even if we weren’t repeating the query, we would also reduce the round-trip overhead of, say, 3500 queries to the overhead for 1 query - the actual query would all also take less time due to the lack of repetition.

Another super useful part of using the SQL Profiler is that you can see the exact query that is being made for the original code, even if that query is generated by Entity Framework. In this case, the query was correct, a single column queried after ordering the responsedates newest first and then taking the top 1. We therefore have to ask the question of how we can effectively merge all of these into a single query, which is not always obvious (Chat GPT can be quite good at answer natural questions if we don’t know the right words to choose) but fortunately, I have used ROW_NUMBER() before as a way to use a Common Table Expression to create the ordering and then a select statement to only take row number 1. The query was something like this:

WITH cte AS (
	select
		surveyid,
		endedsurvey,
		row_number() over(partition by surveyid order by endedsurvey desc) as rn
		from responses
        where surveyid IN @surveyids
)
select
	surveyid,
	endedsurvey
	from cte
where rn = 1

ROW_NUMBER() is clever. The “partition by” is asking, “what do you want me to split into separate groups?” and the order by is, obviously enough, the order that those groups will be in. surveyid is our partition so each row with the same survey id will be numbered separately and from newest ended date to earliest. We then get a number out, 1 for the first row, 2 for the second etc.

Outside of the CTE, we then select the columns we want but nice and easily, we can limit it only to the first row of each group or partition.

Now we can do this in Linq theoretically but just because you can doesn’t mean you should. Linq hides so much that even if you can get the correct result set, you still have to double check it is behaving properly and not, for example, pulling back all columns of a table and therefore possibly missing an index. Also, you have to be careful that the Linq expression can be translated into SQL so it can make best use of SQL Server’s strength at optimising queries and filtering rows. If not, you can return 1000s of rows (or more) before they are then filtered in memory to perhaps only 10s of results wasting a lot of bandwidth and database time.

We use a combination of Entity Framework Core and Dapper and for explicit SQL, Dapper is pretty easy to use and can automatically bind the results to an object, like EF can. I passed a list of int as the parameter @surveyIds and Dapper turns that into the correct format for IN.

Once this data was returned and put into a dictionary, instead of querying the database in every loop, we simply did a Dictionary lookup for the most recent date. Sure, the query relative to the old query is much larger since, in this case, the “IN” clause has 870 entries but we are still talking < 100ms and a one-off hit up-front.

Tip: Using a distinct on the surveyids list ensured I didn’t make the query any larger than necessary since there is a maximum limit on IN although it has never been formally expressed, it is around 2500.

Tip: When you have worked out your SQL, make sure you run it with the query analyzer to ensure it is not doing any work you don’t expect and it is hitting an index rather than performing a full scan.

This first change reduced the number of database calls to around 4000 (a reduction of nearly half) and halved the execution time from 2 minutes to 1 minute. Still far too long but much better than before and leaving me to start considering problem 2.

Problem 2 - Get plan for feature

The idea of this query is to say, if they are using feature X, what is the lowest plan that supports that feature. Again, the author naively treating it as a simple loop is, again, calling the same thing multiple times for a total of around 3500 queries. Again, it would be trivial to make a single query to get all of the potential feature to plans maps and then to access these as a dictionary during the loop.

I could see this being called in the Profiler and made a little dig to find out where I thought it was being called from and a top tip is to always prove your assumption. Even in this simple case there were multiple overloads of the method and I could have changed one that wasn’t being called! Using the EF extension method TagWith("text") you can add a comment to the top of the SQL that EF produces which is an easy way to see in the profiler that what you think is being called is correct.

I say trivial but this is, again, where you have to understand what is already happening and try and understand why. The code looks like it was intended to return all plans that the feature is supported on although the query contains a GROUP BY and a First() so it appears that they are only getting the lowest plan and not all plans. Similar code is used in another application so I suspect that the same types are being used even though in this case, it only wants one plan result, not all of them.

I had to do some refactoring here because these were called from a lower loop and therefore I needed to inject the master list into the method in order to avoid that method making it’s own queries. Again, I didn’t want to make a breaking change so I simply created a new method that was similar but different.

In a similar way to problem 1, this was a CTE using ROW_NUMBER() but it was slightly more complicated!

I already mentioned that the existing type for the results was not flat but had a property that had a list of “plans” instead of a single plan. The result from the query only contained a single plan so how could I project this into my object? Well, I learned that Dapper supports multi-mapping and has an overload that takes two input types and an output type and allows you to use a lambda to set it. This is the code:

var results = contextAccessor.Query<FeatureResult, Plan, FeatureResult>("The query CTE similar to before",
    (result,plan) => {
        result.PlanList = result.PlanList ?? new List<Plan>();
        result.PlanList.Add(plan);
        return result;
    },
    splitOn: "PlanTitle")
    .ToDictionary(result => result.PropertyName, result => result)

If we were returning multiple rows with the same FeatureResult values then the lambda would be invoked for each row allowing you to populate the list of plans with each iteration! How does it know which columns are FeatureResults and which are Plans? Using the SplitOn parameter, we tell it the first column of the next/2nd type (you can have more than 2). In our case there was only one row per FeatureResult so this was only called once and we could have changed the code slightly (PlanList would always be null) but this is fairly belt-and-braces and allows for returning more plans if we needed to.

Replacing the loop query with the up-front data load reduced the number of calls from 4000 down to 450 and the page load down to 12 seconds. Tonnes better than we started with but hopefully I could make it even better.

Problem 3 - Getting translations

This problem was revealed as much smaller than the others but completely exposed once the other two problems were fixed by the simple use of eager-loading!

In the application, when you try and open a survey that has premium features after downgrading, you get a message telling you that you cannot without upgrading. This feature shows the actual question text if the feature is quetion-related, in order for the customer to find which question it is really easily. In a quirk of our system, if you use Net Promoter questions, we don’t store the question text in the normal way but simply store an index which relates to “Company”, “Product” etc. and which allows it to build the text consistently and with an optional translation for that question in another language.

Due to this complication, the original author wrote something that was functionally correct but performed terribly when used in a loop. It would hit all NPS questions at which point it would query the survey to find its translation id and then it would load the translation for that question and populate the question text. In other words, 200 NPS questions would be 400 database queries.

At this stage, changing it would break it and eagerly loading it seemed a bit too niche to take the hit on so I did something that worked for my use-case: I created another overload of the method that doesn’t make any database calls but instead simply puts “NPS” and the question id as the question text.

There is often a problem with reusing an existing solution for a different use-case: something acceptable in a single call that might take a few seconds might be acceptable but calling the same thing in a loop hundreds of times simply multiplies that latency and makes it unusable. Also, this listing doesn’t even display the question text but it is still generated as it is needed by the original use-case.

This was a simple change that didn’t require a new DB query and removing these calls reduced the number of DB calls down to a paltry 5 from 7600. Once the intial 5 second cache hit was taken, page load and paging was now 2 seconds, down from 2m05!

Should I do anything else?

At some point, the returns start to diminish and you need to ask whether what you have done is reasonable as it is or whether you should make a bit more effort. At this stage, there were hardly any database calls and what was there was pretty much needed for the code the way it is currently written. I don’t need to win brownie points or Employee of the Month, I think it is now more than useable but it does beg some questions that the original author should have considered early on. It is shocking that this ever went live in the state it was in. I would rather have not released it at all if the page load was taking even 20 seconds let alone 2 minutes+

This page is a listing control that defaults to 10 items (although that should probably be more) and the code that renders the listing is designed to use IQueryable for the simple reason that we can further reduce an un-materialised query to a specific page saving the expense, that is the point of IQueryable. However, in this case, the author did not know how to do this (or didn’t know there was a problem), he simply materialised the entire thing and then returned it as IQueryable so although the paging was there, it was paging an object that was materialised every single time you changed page! This code is more complicated than a simple listing but at minimum, the fully materialised list could have been cached for, say, 5 minutes to at least allow paging to be quick but ideally, they could have refactored the code to actually think about what each page represents and working out how to query those objects using IQueryable after which they could run the main queries only to populate that page but this wasn’t happening.

Like I mentioned before, caching is a code smell except perhaps in very high traffic sites, so that sort of demonstrated a lack of understanding of what was happening and how to do it properly. Even a stored procedure can be parameterised in some way to reduce the dataset being worked on.

A more fundamental mistake is in the whole reason for the page in the first place. The original feature checking code was designed to warn the customer about features that would be lost on downgrade and show them which features might prevent a survey being reopened on a lower plan. That is all fine. However, when this page was added to Admin, the intent was to provide information for Sales Managers when dealing with downgrade requests. “Can I downgrade from Enterprise to Business to save £3000 per year?” “Yes you can but you would lose the Blah feature which you have used on 50 surveys, is that OK?”

So if this was the case, the implementation which splits out every single premium feature on every survey is too detailed for that use-case. A detailed view might be helpful if pushed but the basic information would be a listing that says Feature X used on 50 surveys, Feature Y on 70 surveys etc. which would be much easier to communicate to a customer but this was not considered and what was produced was therefore not only too slow on large accounts but was not a very useful feature when it was finally released. In fact, I know it wasn’t used much when I found a feature that caused it to crash since that feature was not on any plan by default and the code assumed every feature had a plan and fell over!

The page already has an excel export which would perhaps have been more useful for the detailed view and leaving the page to show a summary.

What did I learn?

Even Senior Engineers sometimes produce garbage and don’t ask for help to understand why.
A couple of hours and some basic tools and you can make something terrible into something good without much risk!

Dotnet core not binding a json request to an action’s model

2024-05-07T19:08:00+00:00

Background

You know when you do something that you are sure should just work out-of-the-box but it doesn’t and you scratch your head?

I was trying to post a json request from a Javascript file to a dotnet core MVC action and expected it to bind to the model automatically. It didn’t and I didn’t know why!

Problem 1 - [FromBody]

Prior to, I think, dotnet 6, json model binding would work out of the box with nothing special. Set the content type, make sure the model matches and boom! However, this was changed “for security reasons” so that now you need to explicitly state that the model is [FromBody] so that it will check the content type and invoke an appropriate binder (Json is configured to work by default).

It is also not possible to pass parameters by body and also by query, the model binder won’t like it and the query parameters will not be set.

Problem 2 - The whole request needs to be valid Json

This is where I tripped up. I had a model like this:

public class MyModel
{
    public int Id { get; set; }
    public List<Response> Responses { get; set; }
}

public class Response
{
    public int Qid { get; set; }
    public int Oid { get; set; }
}

As far as I could see, the json was being created in Javascript (I used the debugger; keyword to step into the code in the browser) but it wasn’t working. I wondered if it was case-sensitive so I changed the properties to be lowercase but no dice. The model was always null.

Logging to the rescue! Dotnet core logging is probably the one biggest improvement over netfx, which had almost no logging. I set the level to debug and ran the code again: “Model binding failed due to invalid Json” or something like that.

I had mistakenly used a string representation of the Responses and passed it to JSON.stringify() which effectively double-encoded it and meant it didn’t work. Even though it was only the Responses property that was broken, this meant the entirety of the json request was not valid and none of the properties would bind correctly!

You also get no errors, just a null model.

So there you go! Make sure your Json is valid, make sure you use [FromBody] and make sure you use dotnet core logging!

NPM in Docker is a dumpster fire

2024-04-24T11:43:00+00:00

Background

Trying to run a gulpfile via node/package.json inside a Dockerfile that is building a dotnet 8 application.

Easy right? Nope.

Ubuntu Jammy 22.04 + aspnet 8

This is the base image for the build/deployment (the sdk image has build tools but not relevant here). Of course, it doesn’t include node or npm out of the box because why would it? It includes the minimal installation to build and run a dotnet core app.

Option 1 = apt-get install node npm

So this works easily enough and installs these packages and although npm installs an astonishing amount of dependencies, they were only needed for building and I might be able to worry about that later. The problem, as you might expect, is that the package versions are very old. Node 12 in fact, which was EOL in April 2022. Not surprising since the LTS version of Ubuntu was released in April 2022 and LTS don’t like updating major versions of things which cause breaking changes. Which, of course, is generally OK except node isn’t like everything else. The release cycle is very short: 18 months active followed by 18 months of maintenance. Anyway, I can moan about these old versions but I can’t really do anything about.

Ran the rest of the Dockerfile including yarn install and I get an error, basically, one of the npm packages requires a newer version of node (v14, also EOL a long time ago but not installed by Ubuntu) so this option is no good. Hmm.

Option 2 = nvm

nvm seems great but only because node is so bad at handling multiple versions. It is an open source utility that can download and install multiple versions of node side-by-side and you can then select the one you want to use either now or for all future invocations of node. Seems to be a good candidate for updating my node versions.

Except: Docker!

Run the install script in Docker and that works but then you get “nvm command not found”. Why? Because Linux and it’s weird way of updating paths etc. nvm is not actually a binary but a script pretending to be a binary, which is fine if you are using it interactively. Why? Because it adds some stuff to ~/.bashrc that will make it visible to your shell. If you look at the docs, you can either logout and login again for nvm to appear or you can even source your .bashrc file to reload it without the logout. OK.

RUN <nvminstall script> && source /root/.bashrc

source: not found

Trying to get source to work

As I keep going further alongside, it feels like I am taking a car apart to do something that should be easy like changing the stereo unit and there are various suggestions to try and get around this issue. What is happening is that Docker is using /bin/sh as the default shell, which is not bash and doesn’t, therefore, support the source command.

Option 1 - Re-link the shell executable

One particularly hacky solution, related to Docker, was simply to remove /bin/sh and then create a symlink from /bin/sh to /bin/bash. Someone else said this was a terrible idea since other scripts would expect /bin/sh to be POSIX compliant and since bash is not, this is likely to cause other problems.

Anyway, this didn’t work

Option 2 - Set the SHELL

You can try things like SHELL ["/bin/bash","-c"] to set the shell for all of the rest of the script. This has a similar effect to the previous hack and ideally you should only set this for the commands that need it and then change it back to SHELL ["/bin/sh","-c"]

Also didn’t work. nvm command not found

Then I read that the whole .bashrc thing is related to an interactive login which is what you would get with something like ssh but not inside Docker. If you do not use the login, there is no .bashrc so you can’t source it, but you can pass “-l” to the SHELL command to make it interactive.

Still not working. It now gets past the NVM section without error but then npm install fails again because “you need version 14 and I found version 12”. Just like nvm hadn’t run at all even though I could see the Docker build log downloading and installing a newer version of node.

Maybe I need to update npm too?

There is a symbiotic relationship between node and npm even though they are supposedly independent. If you download node, you usually get a specific version of npm included and although this is probably just to derisk it, it does create a lot of confusion. Can I use a really old version of npm with a new version of node?

So then another rabbit hole. You can use nvm install-latest-npm which supposedly finds the latest version of npm supported by the currently selected version of node. Sounds good in theory, except you need npm in order to get the latest npm, which wouldn’t be so bad, I installed it from apt remember, but the next annoying error, “cannot determine npm version” or something like that. It was like the whole nvm world was separate than the normal world and they were not talking to each other. Of course, I had installed npm as part of the same build process so maybe it was a similar issue to source where the nvm process cannot yet see/read the apt npm version.

Instead, I thought I could use npm to update itself instead of using nvm. It should be able to find itself? Tried that too: “newer version of node required. Require v16, found v12”. npm doesn’t seem to have a method to say, “download the latest one for the node version I have”, although since the apt node was so old, maybe it was already the latest npm version it could handle and npm wasn’t seeing the downloaded node v21 from nvm for whatever reason.

Eventually I got to the end of the build layer

Initially, I didn’t even know if everything was actually working (some simple scss compilation), I just wanted to get the tools running then I could debug, I thought if I could get to the end of build, I would have enough to start working out what was or wasn’t working correctly.

I had ended up setting the bash shell right at the beginning of the build layer since it seemed like each change of shell lost visibility of the previous changes so keeping everything inside the bash shell made it look like it was working.

Except not! A really strange thing was happening on the last step, which was unchanged from before, building the csproj file using dotnet cli. When running this command, it spat out all kinds of random errors related to files that were in the repository but are not referenced in my project (or solution). They are in the repo because they are part of a submodule. How on earth could this be right? I wasted another few hours scratching my head and getting frustrated.

Going back to the start

It’s sad when something is such a mess that you spend a whole day on it but then have to go back to the beginning. If I build the original Dockerfile it’s fine. So let us try and add things in one-at-a-time and see where it goes wrong. We don’t need to run any gulpfiles or whatever, let’s just try and get it building.

So that is what I did. The most likely problem was changing the shell to bash so I decided that I had to work out how to get things working without using source since that was the only thing that was needing the bash shell.

Using some of the techniques listed later, I found what nvm downloaded and added to .bashrc and realised that the easiest (but horriblist) way of working around the issue is that every command that needed node needed prefixing with the code that loads the nvm script and so a line would look like this: RUN . "/root/.nvm/nvm.sh" && nvm install 21.7.3

It was weird that I needed to add that each time I ran something like yarn install.

Anyway, after far too long, something that should have been much easier than it was ended up being a big learning experience. I already hate most things about node and npm and this was just more ammunition.

Debugging tips

Apart from the usual tips about using layers and cache correctly and understanding enough about Docker and (in this case) Ubuntu so that you know roughly what is happening, here are a couple of other things.

Debug the state of your container

If you are seeing that commands are failing etc and you don’t know why, get your build to complete to a certain point without running the failing command. You can also use --target to target a particular stage in the Dockerfile to avoid building everything. Once that image has built, if you try and run it from Docker Desktop, it will immediately exit unless it has a CMD/ENTRYPOINT but if you get its hash from the image list in Docker Desktop, you can run it like docker run --rm -it <image hash> bash and get a container that you go into and then using command like ls or even running the failing commands to see what output you get. Sometimes, errors are simply because you got a working directory or copy command wrong, others are a bit more involved.

If you can get the commands to work in your debug container, you can then update your Dockerfile and potentially build to the next error.

Build your own base images

This is a bit more work but if a lot of your apps are based on e.g. Ubuntu + node 21 then you might as well create a base image and start with that. It will save having to solve the same problems in multiple places as versions change and applications update.

Consider your own apt cache server

I was having such random performance problems with apt-get (>30 minutes to update and upgrade) that I used debmirror to setup my own apt cache for Ubuntu 22.04 which only takes around 380GB of space but you can manage your own performance then and not have those days where the archives are a slow as mud.

The “you’re over complicating it” meme

2024-04-07T09:10:00+00:00

This Thomas Millar article from 2023 appeared on Hacker News today. It is a bit of a meme in our modern day “Cloud” times that clever people enjoy pointing out that everyone who uses the cloud is over-complicating it and the obvious solution is for people to host bare-metal. It is a tiring meme and although I don’t think I will add much comment to it that is not already in the Hacker News comments, I find debunking these things good practice for critical thought and for writing practice.

The Summary

By all means read the article but it goes something like this:

You could host a “top 1000” web site using 11MB/s bandwidth on a single bare-metal server/VM
Most of you are not top 1000 web sites
Therefore if you are using any kind of cloud infra, you are doing it wrong
And you are probably paying a premium

My rebuttal is:

You mis-represent the top 1000 site’s needs
You avoid various things like RPO, RTO and availability
You do not understand the non-theoretical of many sites that are not top 1000 but still can’t use your model
Many of us have hybrid systems that use cloud and bare-metal. We do know why we chose that architecture

The zero-sum fallacy

The article starts from a strange premise. “if you’re getting something for a steal, someone is getting screwed”. Without defining “steal”, it is certainly not a zero-sum world and it is possible to get someone for a reasonable price without paying elsewhere. This slightly strange intro then leads into the point he makes “Amazon makes the majority of its profits from AWS, and yet so many people talk about how the cloud is so cheap…”. Do they? I don’t remember anyone saying that the cloud is cheap. It can be VERY convenient, especially for early stage companies who don’t want the cap-ex or maintenance of a production system until they are sure of their product/market fit but cost alone has never been a reason.

That said, the implication that just because Amazon makes billions of dollars of profit from AWS means we are all over-paying is, again a misunderstanding of how the market works. AWS has over 1M active users according to this article. Now out of the $500B in revenue (estimated from recent articles), it is making around $20B profit, which is both not very much of its revenue(4%) but when you consider that 10 companies make up a significantly large amount of AWS revenue (Netflix, Twitch, Linked In etc) then how much are we really over-paying? Amazon have a large ability to leverage volume discounts on hardware and are giving jobs to the people who work at those Data Centres, who would otherwise have to be employed by us.

Anyway, I don’t think this is a great foundation for the article which is more about complexity and performance than about whether AWS are taking us all for a ride. Also, other cloud providers are available!

The strawman website

He then choose Business Insider as an example of a “top 1000” web site to base his starting numbers on. 200M visits a month, 2 pages per visit = 400M HTML documents per month.75KB compressed HTML = 30TB of bandwidth.

The thing that is a little weird about this is that you just have to visit Business Insider to see how much of a mis-representation this is. Like a lot of sites, BI is loading things asynchronously over time and uses lazy loading as you scroll down the page so the first impression sizes are not accurate at all. I just clicked through a link there and these are the stats:

1873 requests (from one page!)
72 of those are documents
52MB transferred, uncompressed to 80MB
Largest load in this page (which is not the page itself) is 249KB
16 CSS files, 209 scripts, 6 fonts and 413 images

Now, these memes would normally shy away from using such a site as an example because the purists would say that this site is garbage and should be optimised. That might be true to an extent but I will defend this site and its right to be like this.

Business Insider (or Insider Inc the company) have 550 employees according to Google and presumably it is a heavy user of advertising which means also a heavy user of metrics to track effectiveness of placement, click-throughs, engagement etc. Now wouldn’t it be nice if all of that just came from a single magic library in 10Kb of Javascript but it doesn’t. It comes from multiple libraries, some of which might be more useful than others. Some will be large and some smaller. Some make loads of back-end requests asyncronously and God-forbid, some of them have been forgotten about but no-one knows enough to work out whether to remove it, or, like on the Railways, you leave it there forever in case removing it breaks something important.

So if we look at 400M docs per month, even if we could offload all of the static content to a CDN, we are still talking more like 80TB per month, which is 30MB/s, and remember, this is average, not extreme. Although the average might be 2 pages per visit. What about the few thousand people that click into lots of pages. Maybe they have it open all day because they use it in their job. You can’t just take 2 pages and 200M people and create a figure that is accurate.

He is also says that 75KB of HTML is a lot, which is laughable. Maybe in his neat niche world where he is still at a stage where he can easily control his sites, this is true but you will struggle to find many pages on the web less than this. Even the Google homepage, the common example of slick content is 76KB compressed so maybe if Thomas is so clever, he should go and work at Google and get their page size down because it is “A LOT”.

The timing fallacy

As many quickly pointed out. As well as the average in the article likely being far below what it actually is, the truth is that traffic is unlikely to come in consistently across the day. Business Insider is largely US material so that means probably (I don’t know), a lot of the total traffic will concentrate during the day between EST and Pacific time. What does that mean in reality? Probably much less traffic weekends and overnight meaning what? Maybe 60MB/s AVERAGE and potentially much higher than that if they have just released an important story.

I work at SmartSurvey and our surveys are used by 1000s of respondents across 1000s of customers. You might assume, therefore, that we would have a good spread of traffic but we don’t. Some days are quiet (< 1000 on a survey) and other times we have peaks usually > 10K people on a survey. Evenings and nights are quite (we mainly target UK customers) and weekends also that way.

Each of these objections can be brushed away, “yeah, these are only rough figures” but the problem with these types of articles is that each of these inaccurate assumptions are piled on top of one another so the error margin is multiplied for each stage until it isn’t just a strawman, it is a strawman made of strawmen!

The refactoring fallacy

He also makes another common and naive assumption about the way that larger organisations work. “Reduce the HTML size..with a CDN serving your JS, CSS and images…”. Again, great for him if that is possible but things are not so straight-forward in the real world. Again, at SmartSurvey, we host a number of apps. Some of these are served by Windows Servers and we also use a CDN and some microservices. Can we move all of our scripts to a CDN? nope. They are generated by various sources and might be built as part of our build process. Is it worth investing the time into pushing them to CDN? Well some already are but in other cases not worth it. Why? Because we already pay for these servers, they are bare-metal so unlike the cloud, I can’t hire a cheaper model by moving some of the requests off of them and they are not struggling to keep up. Real business is pragmatic. Spend your time on things that matter.

Could we do it if we were starting from scratch? Possibly but, again, within a year or two most of us are held to a certain amount of restriction by the choices we make, the libraries we use etc (and the things that Marketing tell us we have to embed even though us Developers don’t like them).

And we are a 5th the size of Business Insider and much smaller than other companies who run top-1000 web sites so if it is hard or not worth it for us, is it really worth it for them?

The lazy conclusion

So then he says that newer hardware can easily handle the badly estimated 11MB/s of traffic. Except that’s not all the traffic. We already talked about the fact that the average load was probably significantly higher and the traffic probably spikes significantly higher than that.

These articles I read like to pull some headline figure of a well-configured server and quote this is a simplified view of what you are trying to do. What about when your server OS blips or your network has a funny moment and all of those requests back up? What happens when your network card chokes and the errors cascade?

He then makes an even more ridiculous claim that since new servers are not very powerful, “You can server the world from a single box”. What utter nonsense. You can’t even serve a single netflix movie across the world with a single box, which is why Netflix don’t have a single box.

“Why do we need Docker, serverless, horizontal scalability again?”. Because you don’t understand all of the reasons people use these, clearly, otherwise you wouldn’t ask this question.

Let us just take Docker as an example. Yes, it provides the means to microservice deployments and yes, microservices are sometimes used where they are not needed but Docker is not just used for microservices. It is useful for setting up Dev environments since the dependencies are in images, not in the host machine. It makes setting up build agents much quicker for the same reason. Getting a set of working applications/services is much easier than trying to set them all up to run manually. I know because I have done it both ways. I know that pre-containerised workloads was very much a lot of time invested in making my machines work like pets. All of the time wasted setting up new machines and servers because something wasn’t documented. We use a microservice for our email sending. It centralises sending from multiple applications into a single place, which is much easier to manage, and it runs in Azure Kubernetes Service for a few hundred pounds per month. Ironically much cheaper than our on-prem services cost to rent. Yes, I did run my own Kubernetes Cluster for a while and that was far too much work and cost much more than AKS in time. I also looked at other options but the simple truth is that this works for us and yes Thomas, we do know there are other ways to do things.

The missing middle

The article also assumes that business run on a single web application. Easy to optimise that right? Well, no it isn’t but easier than the fact that most businesses have 10s, 100s or even 1000s of individual systems and some of them do have to talk to each other. If I put one of my systems onto bare-metal instead of a managed service of some kind, what about all the others? Do the same thing? Do you know what it’s like to do Windows and Linux updates each month on bare-metal servers?

As well as the added complexity that isn’t really acknowledged in the article at all, the other issue with the complexity is that the time that I could be optimising my simple front-end web site is the time that I am not spending on the other much more difficult but important back-end systems, monitoring, server management etc.

The funky Edge data

He seems to dive straight into an objection related to the latency at the edge when running a single beefy bare-metal box from a single origin but this feels awkwark because although people do use the cloud for latency reasons, this isn’t an objection that he has previously raised. He already talks about using a CDN so depending on this definition of “cloud”, which he also didn’t define, then he is already using the cloud (the CDN) to get good performance with his bare-metal box. Confused? I was.

Perhaps he is only berating abstracted cloud services, not cloud per-se or maybe CDN providers are good but if they are also cloud providers, they are evil. It is also possible that this article is assuming that people are using SPAs, which although common doesn’t really fit into the previous description of Business Insider serving HTML documents. Also, SPAs are usually served by CDN so not a particularly over-engineered solution warranting this article.

The next paragraph says [assuming your JS, CSS and media coming from CDN], “is if you can shave 300ms off your server processing…you’ve effectively moved your server across the world”. In other words, it sounds like the counter of the argument “we use cloud providers because we can use the edge to reduce latency”. This another example of a strawman in this article. People use CDNs to reduce the latency of those types of files, perhaps but we all know that there is still a need to make a call (depending on your app) to the origin server. We all know that this latency is important but the CDN is also to reduce the number of concurrent connections that might be made to the origin server and to remove the need to send cookies. All just good practice, not all just with the desire to reach Australia and appear like we are running locally. If I wanted to do that, I would create a more complex architecture so we could have origin servers in Australia. But still, he mentions the CDN and it is still a cloud product, so I am still not entirely clear what he is arguing against.

The rest of this section seems to double-down on the fallacy that we use edge services, like CDNs, to remove all latency and we are obviously wrong. “Nobody ever thinks about the database” is another ridiculous statement. That is mostly what we think about at SmartSurvey because it is the largest “Singlish point of failure” (not quite single because we use a failover but still a good candidate for a global outage).

Even more weirdly, he then defends his unclear argument by saying that you could get most of the benefits you would get from the edge you could get from Cloudfront! Yes, A CDN by a cloud provider (the one he started the article saying was ripping you off) and, YES, a system that uses edge locations to reduce latency.

Then he flips right back to reducing the international 300ms latency by using SQLLite instead of a relational data. Now, the 300ms latency was introduced as the realistic time to get a request across the world, not the time it takes to load a server-rendered page but now we are using it to mean “general slowness of the applcatio”. Again: VERY confusing what point he is making. The best I can work out is that “cloud is bad, therefore managed postgres (for example), is bad, therefore just use SQLite” but where did that come from? SQLite hasn’t even been mentioned before and all of a sudden, this is a solution to people using cloud? Why not just suggesta locally managed database instance. On both our on-prem SQL Server and our Cloud VM hosted postgresql cluster, we get single-digit mS access times so not really related to anything significant of the 300ms application latency or anything related to global latency.

Ah yes, the cost

Cost is a very common red-herring when people write this kind of meme posts. To people who lack experience, something like $500/month for an AKS cluster is extortionate. You could rent a VM for only $10/month and even if you wanted 6 of them to make a redundant cluster, it would only be $60/month you fools!

When you turn over £4M/year, an extra $400+ per month is nothing. Literally a rounding error on the accounts. Sure, we could save $400/month at what cost? My time is probably worth $1000/day so who is being the fool? I did try once to manage a Rancher cluster and it was a nightmare. The docs were all over the place since things changed between versions, there were multiple types of cluster but it was never clear which was which and upgrading a major verison of Kubernetes, which is point-and-click in AKS required me to get our Data Centre provider to write some scripts which I had to run manually on the VMs. Of course it is doable but again, it becomes a pet. I probably spent more than $400 month just keeping the thing running not to mentioned regular node draining for Linux updates. Sure, maybe only fools use microservices (or Kubernetes) but I prefer Kubernetes to restart failed pods than being woken at 2 in the morning because our email system has stopped working.

Maybe we should have a 24/7 support team to take care of that? Yep, that’s at least another $100K/year and that’s on top of the recruitment time, replacing people who leave or paying the premiums that you don’t want to pay AWS to a Contract Support Business instead?

In a familiar vein, Thomas then uses Hetzner.com (a cloud provider) to demonstrate why AWS is expensive. Is this whole article just a rant about AWS? Or are you trying to dress it up like a cleverly reasoned article with its very selective choice of comparisons? So the providers charge different amounts? Doesn’t sound unreasonable. Every business in the world decides on their pricing model. AWS gives away free tiers and then charges. Some others do, some don’t. SmartSurvey has unlimited responses on all business and enterprise plans, Survey Monkey, instead, gives you all the features for free and charges for responses. That’s OK, it’s up to us and it’s up to them. If you don’t want to use AWS because they are a rip-off, there are others you can choose but you pay your money and take your choice. Do you want AWSs more data centre locations? Or OVH’s EU based business or Hetzner’s prices? Great.

He also mentions “edge cloud vendors” whatever they are. Vercel charges $200/TB once you pass the free tier. Great, let them charge what they want. if it doesn’t work for you, don’t use them but implying that people who choose the cloud are stupid because some companies charge more than you think is reasonable is disingenuous at best and ignorant at worst.

The reality of the situation

Never trust someone who makes a statement about the “reality” when talking about such a large topic as this. The reality is simply that providers provide services that work for some people at a price they are prepared to pay and not for others. That is fine.

Do some people over-engineer? Of course they do, just like some people always want the latest version of Vue JS, it doesn’t follow though that everyone who uses the latest version of Vue JS is over-engineering or stupid or following a band wagon or whatever.

Saying that you can probably run on a single server is again, naive at best, and a gross simplifaction at worst. If you want to run your system on a single server. Great. Do it and good luck but please don’t tell me that I can or should.

“Use SQLite locally on the box, you don’t need a managed database”. These are not the only options, this is the false dilemma fallacy. You could use SQL Server or postgresql on the same box, or other boxes nearby, or in your own data centre. You could run it on VMs in a cloud provider or something else. This statement is neither necessary for the overall post (whatever that is) and is not even true. He also throws in, “Use Litestream to back it up”. Why? Are you trying to sell Litestream? Why only that option. Why not, “make sure it’s backed up”?

“Your CI can just SCP your code to the box”. How do you know? That’s not how .Net Framework works. Nginx supports zero-downtime. That’s great but not everyone uses Nginx or can use Nginx or wants to use Nginx.

“Don’t bother with docker and virtualization and all that nonsense”. Already mentioned this. The author doesn’t seem to understand all of the reasons why people use docker and virtualization.

“They say you’ll need to scale…” Who’re they? That doesn’t seem relevant at all to this discussion. I can can scale with on-prem servers and with cloud services, just another irrelevant sound-bite thrown in to make the article sound more authoritative.

“If you really care about latency, throw a server in Germany, and one in California, route your writes to your primary, use your local read replica for reads.”. So what? Do what everyone is already doing before you told them all to use a Single Server and SQLite?

Conclusion

Like all of these other similar types of articles it is patently nonsense. The calculations are fag-packet at best and potentially far, far, off the actual reality but we already know that because those of us who run large sites already know that. Contrary to popular opinion, a lot of us who make these decisions do understand the trade-offs and although our decisions are not always perfect and maybe we sometimes regret our decisions, I am happy that we run on cloud services as well as on-prem servers. I also look forward to the day that we can run more applications on this “Docker nonsense” which significantly reduces the amount of my time I have to run these systems and even at $50K/year, it is still significantly cheaper than employing people to do all the stuff that Azure or AWS do for me, even if they are charging a premium for it.

I have run small VMs for scenarios where it works, one of them runs a local copy of MySQL and it’s great but it wouldn’t work for what we do at SmartSurvey. Maybe one day, we will get rich enough that we could save a few million by bringing things in-house but I doubt it. These things do repeat every 10 years or so and I am waiting for the first article describing how, “We brought the cloud in-house and it was the worst decision ever”, not because it always is but because some things work for some people and not others.

As with every engineering decision ever, your mileage may vary. It always did and it always will.

Nullable types slow down dotnet core

2024-03-18T13:40:00+00:00

Nullable reference types are a feature added in C# 10. It seems a bit weird because reference types are always nullable so what is this feature about?

The problem with what Microsoft call “oblivious nullable state” or the previous scenario where all reference types could be null or might not be null, is that the compiler cannot help you spot the places that you might be de-referencing a null reference. That means that the programmer is responsible for considering all cases through code where something that should be non-null might be null and will throw if left in-place. On the other hand, the programmer might be adding an excess number of null checks for things that will never be null due to the code, which just adds more noise.

Marking a reference type as explicitly nullable like string? myvariable; tells the compiler that I know this might be null so please help me to find places where I am de-referencing it before checking for null, which by default is a warning but can be elevated to an error if needed. On the other hand, if I do NOT mark it as nullable, it means that I am telling the compiler that this should never be null. If I do that, the compiler will see any places where the variable might not be set before it is used, which would cause a NullReferenceException to be thrown.

It’s all pretty helpful but it has an annoying (but understandable) side-effect in ASP for dotnet core. If your viewmodel has a property that is not marked as nullable, the framework helpfully makes the property as implicitly required! That’s right, even if you haven’t added the RequiredAttribute, the property will be required and will fail modelstate validation if not set on POST. This is annoying for me since porting an MVC 5 app to dotnet core, it has broken a lot of validation and it is hard to find where this will happen without testing every single action. I could switch it off but if I do that, when will I ever switch it on again? Whenever I did, the problem would immediately surface again.

However, although that is annoying, it has a knock-on effect that you might spot on more complicated pages like our feature setting page that has about 300 features listed and which you can enable or disable on a per-customer basis. I was trying to fix an unrelated bug on the page and was testing the validation of the message you type when you make a change by using the required attribute. I added the attribute, ran the app up, left the box blank and clicked “Submit”. It didn’t seem to work….no wait…it is working. What? Let’s try again, leave the page come back, press submit and after about 3 seconds, the validation message comes up and this is Javascript!

You might see where I am going but I did scratch my head for a bit until I looked at the html markup to see what I could see. All of the 300 dropdown lists that contain the feature status were marked as required. This wasn’t affecting the posting of the form because they are all set to their initial value regardless of whether they are changed i.e. there is no “not set” value. However, the application of the data-val-required attribute was causing all 300 of them to be validated on submit and for reasons I do not know about, this takes about 3 seconds before the message for the input appears.

There is always something that makes me uncomfortable about non-obvious problems like these. Implicitly making something required breaks the principle of making errors obvious since it won’t always have an affect and if it does, and you don’t have a validator rendered, the page simply fails to submit and you have to work out how to debug what is going on.

That said, I don’t really know what I would do instead, perhaps just allow us to disable “nullable types = required” in model binding and let us live with the consequences of where we might accidentally forget to set something. Not sure

PostgreSQL Major Cluster Upgrade - Nearly Zero Downtime

2024-03-09T12:28:00+00:00

Upgrading the major version of a cluster in postgresql cannot be done automatically, although an in-place upgrade can be performed requiring the database server to be stopped, the upgrade to be performed, hoping it all works OK and then starting back up. When this is applied cluster wide, there are a number of failure scenarios, which could add up to a long amount of downtime if it doesn’t go well or if the database is large.

To be fair, this is the case with any database system. A major version is likely to include a number of breaking changes, at least at implementation level, which makes it non-trivial to upgrade.

At SmartSurvey, we are 2 major versions of Postgresql behind and although this might not be considered “ancient”, it also requires us to look into the not-too-distant future when we might need to perform an upgrade more quickly. In other words, we need to plan now, hopefully script or document the upgrade process so that we can either upgrade now, or in the future when we need to.

The two options you have to upgrade a major version of Postgresql to avoid downtime, would be pg_dumpall which is like taking a backup which can then be restored onto a newer major version, or, logical replication, which can be setup from older version to newer version and once the new cluster is setup and tested in-parallel, you can then perform a quick changeover, changing back if there are problems.

pg_dumpall has the issue that the backup is not being kept up-to-date, so you would need some plan to get any new data that is added after the backup is taken or you might either stop the database or otherwise possibly block any writes if that is feasible. These sorts of approaches are risky. If they work, they can be problem-free but if they don’t, you could end up in a big mess quite quickly.

Logical Replication

Thanks to Knock who helped me to both understand logical replication and also to just get confidence that there weren’t too many variables and it was all going to be fine! At SmartSurvey, we currently have 1 planned downtime per year on Christmas Eve for 4 hours to re-build major indexes etc. (on SQL Server) so we ideally didn’t want any downtime and it is possible with logical replication.

Logical replication is less efficient than the physical replication that is used for streaming, essentially because it isn’t just replicating the physical contents of the log files for replication but instead a logical set of data, imagine something like SQL statements, that would be larger and need to contain more information that is not needed when copying physical files between two servers of the same major version.

There are a couple of restrictions on this approach but they are workable still:

The current index of sequences (auto increment columns) are not copied during replication - these need to be managed separately
The schema of the databases is not copied, this needs to be done manually before the logical replication is setup
Large objects are not supported
TRUNCATE requires some care to replicate if you use this
You cannot replicate views etc. only actual tables
If you use partitioning, you must have matching partitions on the subscriber

For us, the only thing of any great consequence concerns sequences.

Essentially, the sequence problem means that if you have replicated e.g. 10 rows and the source database sequence therefore has a current value of 11 for the next row, the destination database, if not modified, will still have a current value of 1. If you subsequently insert a row into the destination after you changeover to the new cluster, it will fail since the sequence number will already exist in the replicated table. If you can, you can simply reseed these after all the data has been copied from the source (if the source is not being actively written to) or otherwise you can reseed the destination to e.g. current source value + 1000 so that the source can still increment its sequence and still hopefully avoid a collision with the destination. Obviously, the 1000 will depend on how much data is added to the source table. In high traffic scenarios, you might be adding 1,000,000 for example.

The Approach

This will depend on how static your databases are but basically:

Plan, think, understand. I cannot over-emphasise that 1 person needs oversight who understands all the moving parts. There might be moving parts that some people do not know so if they are all made visible to a single person, it is up to them to plan and agree the changeover plan
Setup a new major cluster alongside the existing
Setup logical replication of the old databases to the new cluster
Test that the apps can connect to the new cluster and work properly
Stop the replication and changeover at the same time

*You might not need to stop the replication first if you have left room in tables to replicate old data during the changeover.

As much as possible should be done before even considering switchover. For example, setting up at least one new VM to start with, setting up the new databases, configuring backups and monitoring etc. You can also check that the replication continues to work in real-time i.e. it didn’t just work once.

You will need to list all of the apps/systems that depend on these databases and work out for each of them how their connection strings work and how you can quickly change over from old to new. If you are fortunate, you might have some kind of load balancer between all apps and the databases, which can be quickly swapped in and out when needed without changing the apps but if not, you might choose to use a feature gate that all systems will use to decide whether to hit the old or the new databases. Alternatively, just change them one-at-a-time after you have tested them on a staging environment.

Reusing the Current Replica VMs

Although it might seem financially prudent to reuse the existing replica VMs and since postgresql can theoretically run multiple versions alongside each other, there are a number of problems with this approach. Firstly, if you are installing from e.g. Ubuntu packages, they will have a layout that might not necessarily be compatible with side-by-side running. Secondly, to install a newer version of postgresql on an older distribution, you are likely to need to add additional repositories to sources.list, which will then likely have updates for the older version which might or might not be compatible with something that is already running. Thirdly, almost certainly you will need to stop and/or restart the older version during the upgrades so you could easily end up with a broken VM or at least downtime.

The best option is simply to add a new VM completely to setup the initial primary node, this could be as well as or in place of an existing replica depending on how many replicas you need to achieve your desired HA. Honestly, for the price of most VMs, you could have them for a month for the cost of one SRE for 1 day that might need if you break stuff.

Scripting the VM installation

One concern about starting from a fresh VM is that it can take time and you might not remember all the steps you took to build the original replicas that are currently working. This is 100% a reason to script the install as much as possible, documenting the rest and we use Ansible. I don’t know how it compares to Puppet or Chef but it does the job for us.

When you use Ansible properly, because it is idempotent, you should be able to create an initial script and keep running it against your target, adding in the various extra pieces you need like some support libraries, repmgr, pg_hba.conf and the like. Most of these will already be available in your old replica to copy from and you can and should be constantly testing as you go along that the cluster is working as expected. It is perfectly possible to have it working, then to break it with a change that you didn’t think was risky (maybe a typo in a config file).

Our script installs postgresql, ajusts the data directory to point to an external disk that we add on Azure, which just gives us some performance benefit (by not running it on the OS disk). We add a tweaked postgresql.conf which adjusts some of the memory allocations, which are all quite low by default e.g. shared_buffers. We also add a postgresql Prometheus exporter and a Node exporter for the VM itself for monitoring.

The only thing we need to do up-front, after creating the VM, is to check/configure the external disk (in one of our scale sets, it automatically gets the correctly mounted disk), add a user for ansible with a specific public key and then update our ansible inventory to include the new host. I also ssh from our build agents to the new VM to get past the fingerprint warning message but I suspect I could do something better for that.

Preparing and implementing logical replication

Once the target VMs have been setup with the newer version of postgresql and the logs checked to make sure it is running OK. We can prepare for logical replication.

Big tables

The logical replication will do an initial data sync after which it will replicate any changes. If your database/certain tables are large, then this initial sync could be a significant network or disk hit on your production system so you might want to test it on a smaller table and see what is happening before deciding how to do it. You can choose only a subset of tables for replication initially and then add the others in bit by bit to reduce this risk but for us, the largest DB is only about 2GB so small enough to manage over a data centre network.

wal_level

Firstly, the old cluster nodes need to have their wal_level changed to logical if not already. Previous to v16, you can only logically replicate from the primary cluster so potentially this change only needs to be made there but I like to keep the configurations the same so I will change it on all 3 of mine. They need restarting, which is a small amount of downtime so I did this late at night when usage on our cluster was very low. You should do this first although I think that as long as you do it before you start the replication, it should be fine since I think the initial load of data does not use the logs, just the subsequent updates.

“logical” is the highest level of logging and will take up the most space since it includes “logical” information that allows a newer version to read the logs. However, since we will eventually be decomissioning the old server, we shouldn’t need to worry, although if you are tight on disk space you will need to be careful and might want to take a backup and cleanup all old logs after changing over a database.

schema export/import

I then used pg_dump to get the schemas for all of the old databases to run onto the new database. This isn’t done by logical replication so needs to be done manually. You will also need to do the roles, which you can do in pgadmin if you prefer by right-clicking the objects and “CREATE Script”, which you can then run against the new database server. These will be needed if your databases are owned by a specific user other than “postgres” (they should be!) before you create the databases. pg_dump --schema-only -U postgres -h 127.0.0.1 -d mydatabase > mydatabase.sql is the command I used. Note that because of the way my auth is setup, I need to use 127.0.0.1 which allows logins whereas without that, it would attempt to use the unix socket which isn’t setup for local auth and will fail. Your options might be slightly different. Once this is done, uploading the scripts to the new database can also be done from the command line on the new server psql -U postgres -h 127.0.0.1 -d mydatabase -f mydatabase.sql

replica identity

When using logical replication, the system needs to know how to identify a source row in the target database. The obvious answer would be “by primary key” but, of course, not all tables have a primary key. If your table does NOT have a primary key, you will need to change the Replica Identity setting for that table in the source database to something that will work for your setup. See REPLICA IDENTITY for details. If you do not do this, updates won’t work during replication.

Create publication

Setting up the source end is simply enough CREATE PUBLICATION mydatabase_pub FOR ALL TABLES; but see CREATE PUBLICATION if you need to only initially replicate a subset of tables.

Create subscription

This is also quite simple CREATE SUBSCRIPTION mydatabase_sub CONNECTION 'host=postgres1 user=repmgr port=5432 dbname=mydatabase' PUBLICATION mydatabase_pub;. If you don’t have the correct permissions with the specified user, you might need to choose a different user, create a new one or modify pg_hba.conf to allow the user to connect. Fortunately, postgresql will tell you if the problem is e.g. pg_hba.conf rather than the more generic “an error ocurred” leaving you guessing!

In my case, the repmgr user only had “replication” access, which is not enough to start the subscription so I needed to add a pg_hba.conf entry to allow repmgr access to all databases. The old server will eventually go so I am happy about this since it is locked down to a private subnet anyway.

Start with a single small database for testing purposes. If you don’t have one, create one temporarily. It is much better to get things working on small tests end-to-end before committing to the larger/more complex databases.

Insert a new row into the old database to ensure it is also replicated across. Of course, there is no reason why it shouldn’t but it is best to check, especially if you have been fiddling and might have somehow disabled or broken the replication process.

Application testing

The drivers for communicating with postgresql probably don’t change much between major versions but even if you have the latest drivers (npgsql for .Net for us), you will obviously need to make sure there are no issues connecting to the new database. If there are, you might need some updates, tweaks to code or whatever.

Another thing that might be easy to forget is that sometimes older features get removed. These are usually subtle or niche features and for us, fortunately, I know that we don’t use anything weird that would be removed. For you, however, you might use all kinds of tips and tricks. I am not sure if there is an easy way to test whether your old database uses any removed features. I guess you might spot them during schema creation or automated tests.

For me, it is a simple case of pointing a local instance of a microservice to the new server and attempting a call that accesses the database. Just to make sure I wasn’t missing anything, I queried something that didn’t exist and then added a new row to the new database to make sure I was definitely pointing at the correct thing. In software, I believe the saying, “measure twice, cut once” is especially true. Something like a stray hosts file entry or a change you forgot you made to DNS could be the difference between success and failure (or at least a very long time spent debugging)!

Sequence setting

Remember, you might want to update the sequence values in the new database to match the old, unless the old one is still have inserts applied, in which case you might want to wait until closer to switchover or simply reset them to a value like the old one + some offset.

To get the current value from the old DB, simply run:

select schemaname as schema, 
       sequencename as sequence, 
       last_value
from pg_sequences

To update it in the new database alter sequence seq_name RESTART WITH new_value;. There might be a quick way to do this for lots of sequences but I’m not sure - I just used NimbleText to take the output of the query and build the update statements for me. pg_sequences theoretically has the last value but it also returned null for me on some sequences, when the direct select statement didn’t so I had to manually replace these with 1 in NimbleText. I couldn’t work out how to “add 1” to the value in SQL so I had to manually add 1 to all of the numbers for those databases that are not being updated during changeover.

Backups

This is another thing that can be setup before any switchover is required and is really important to make sure it is up and running for a while before switching over. We use barman to take backups and currently keep 9 from the primary and 5 from a replica on the old cluster. On the new cluster, initially, since it will be a copy of the primary, we can just keep 1 or 2 once setting it up and leaving it to run for a few days. We don’t want surprises like large backup sizes/poor disk space etc. which might happen.

One thing I will need to do is to reset the passwords on the new server for barman and streaming_barman which are used by the backup server to connect. I can’t remember whether these were created by the script with default passwords or whether there were scripted as part of the schema copy (I can obviously just try and connect from barman and see if I get an error).

You will also need to install the client package e.g. postgresql-client-16 for any new version of postgres you are accessing. If this is newer than the distribution has in their main repos, you will need to add another repo that does contain it. Be very careful here. The external repo (the postgresql one in my case) is likely to contain all kinds of other updates that might not be tested on your distribution. In this case, I added the repo, did an update, installed only the client package for Postgresql 16 and then commented out the repo again!

Although barman can support multiple versions of postgresql, it seems like it cannot automatically determine this, even though postgresql-client-common theoretically can select different versions of tools like pg_basebackup but I don’t think the mechanisms it uses are compatible with barman, at least not the version I have installed. In that case you need to set the path_prefix parameter in your backup configuration to point to the bin directory for the version of postgresql you are using.

If you are testing this with a rarely updated database, no logs will be written for a while and barman check backupname will show “WAL archive: FAILED” for a while (or indefinitely) so you can run barman switch-xlog --force --archive backupname to force a log rotation on the source and get the wal streaming working.

Remember that you should also test that a backup works by restoring to a new server. This is really important because if you document and test what you need to do, you won’t feel under lots of pressure if a server does die and you need to create a new instance from a backup. It is also recommended that you schedule in e.g. a quarterly test to ensure it still works (or find out your backups aren’t running!). Although you can generally test the backup by creating it on a new server and then just connecting to it without needing to bring it online in the cluster, for example, it is also useful to consider what would happen if you somehow lost an entire cluster and wanted to start from scratch with a backup. You might not have this problem though!

New cluster replication

This is another thing that you could setup for a single server initially to keep costs down but with the intention of adding more replicas once the older cluster is closer to being switched off. The replicas are around £30/month so not exactly bank breaking.

This will use physical replication and is something we have done before using repmgr, which allows us to use the barman backups to seed the initial database creation (not that important with our small test database but anyway) and then it will setup streaming replication to the new primary. Again, we then test this by connecting to it and also check that if we add something to the new primary (via the old primary!) then it also appears in the new replica table.

This is also a good chance to test your ansible scripts are up to date and check your documentation is also up-to-date, there are lots of small things that can change over time or that you tweaked manually but forgot to add to the script.

Another test that is easier to make now before we are live, is to attempt to the failover from primary to replica, which I have to do roughly monthly for Linux updates. These failovers do create a small window of dataloss/connection issues and are generally done at night but we can just test our new cluster, that isn’t in production yet, without worrying.

Monitoring

It is a good opportunity now to setup some monitoring for the new cluster. We monitor the number of backups, including failed as well as disk space and replica lag, which is really important to know if the replicas are keeping up or not. This monitoring first alerted Gitlab to a problem, which was caused by deleting a very large account synchronously and the time it was taking to replicate each operation onto the secondary. It could also highlight network problems with your provider etc.

Swapping over

If you have setup a new cluster alongside the old one, you can switchover in stages and how easy it is depends on whether the source database is actively receiving updates and how many.

The discipline here is to try and do everything in an ordered way, after planning, so that you don’t break something if you e.g. switched over an app before you have setup its database!

Apps with static database data

We have a database that is not actively used by its microservice currently, except for a healthcheck table. The data is static and therefore, it was easy enough to setup the replication, re-seed the sequences manually (add 1 to the current value, I used nimbletext to make this easier), updating on the new database e.g. ALTER SEQUENCE "mytable_id_seq" RESTART WITH 1147865; and switch the microservice over to point to the new cluster after testing it on our staging environment. Once this was done, I removed the unneeded subscription drop subscription auth_sub which also removes the replication slot on the source database server, which was no longer needed. Effectively, the old database is now dead and could be deleted either now or simply when we delete the entire old database server. Deleting it now helps make it clear that it is not used.

Another database was active but was still static data so was treated in the same way.

Apps with moving data

Only one of our databases is used a lot - the one that sends emails. Although I was able to stop our batch-sending application, there are lots of other scenarios that can trigger emails at any time. In the evenings, the numbers are low, however. All these do is add a message log entry, which we decided was OK if we lost a couple of these but it did make the changeover a little more thought provoking!

We re-seeded the big tables to be “current sequence + 1000” to give us some space.

We then tested the app could communicate via the new database while replication was still running via our staging environment.

I could have left replication running until the changeover but I was nervous about the sequence numbers and what that might mean if two systems were writing to the same table so I disabled it, deployed the update to production, which was about 10 seconds and then checked the old database. We had only lost 3 log entries (we could have copied them over manually but didn’t bother).

Problem apps

We only had an issue with 1 app. Grafana. Grafana can use postgresql as a backend, which is important when running in a container since the sqlite database it uses by default would either need mounting into a volume or it would lose its data when the container was killed. Also, it obviously wouldn’t work across multiple instances.

When grafana starts up, it checks the database and if it thinks it needs to, it performs a set of migrations on the database. You can jump multiple versions and it will simply apply all migrations from start to finish. However, even though the old and new databases were the same database, when I tried to switch over Grafana (a production system but we could lose it for a minute or 2 without big problems), it basically fell over trying to run migrations and seemed to leave the database in a broken state. I tried deploying the latest minor version, the next minor version and even redeploying the same version but no dice.

I am not sure whether I broke something. Whether I remembered to reseed the target, whether I left the replication on for too long or whatever. Fortunately, I can leave it where it is for now, pointing at the old database (the only database still there), and start again. I will delete the target instance of grafana, create the schema again, add logical replication and then stop the replication before re-deploying the same instance. Only at that point will I try updating it to a newer version.

Conclusion

So this was all very doable and now I have done it once, I am much less worried about doing this the next time we upgrade from v16 to something else. A nice improvement in v16 is that we can now take logical replicas from non-primary database servers, which is a nice little way of reducing the load on the primary.

I will monitor the apps for a few days (I think we will know fairly quickly if there is a major problem) and then I will delete most of the old replicas, apart from one for Grafana (until we can migrate) and increase the number of v16 replicas. We use automation so we can treat them like cattle. Simply delete the old ones completely and setup new ones with v16 instead of v14. This gives them a chance to retire old hardware that might be waiting for us to remove VMs and ensures any “damage” we have done manually gets removed.

Barman backup error

2024-03-09T11:14:00+00:00

The scenario

Setting up a new Postgresql 16 cluster in addition to an existing Postgres 14 cluster and now setting up Barman to be able to backup from both places. Barman is setup and barman check postgres-16 comes back all “green” so now I attempt to take a backup.

The error

barman$> barman backup postgres-16

ERROR: Backup failed copying files.
DETAILS: data transfer failure on directory '/sqldata/postgres-16/base/20240307T131134/data'
pg_basebackup error:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
WARNING:  aborting backup due to backend exiting before pg_backup_stop was called
pg_basebackup: error: could not initiate base backup: ERROR:  could not open directory "./main": Permission denied
pg_basebackup: removing contents of data directory "/sqldata/postgres-16/base/20240307T131134/data"

Why it’s a warning and not an error, I am unsure about.

I have not seen this previously despite having setup a number of barman backups. I also can’t find any internet hits on the same problem but one issue is always that e.g. “./main” might not be the same directory name as for other people so I can’t search for an exact match.

The problem

There was a stray folder in the source database data directory and it was owned by root so was not accessible by postgres. Annoyingly, the message shows a relative path which isn’t too helpful since it is not clear where “./main” is relative to. Also, annoyingly, the name “main” is also the same as the name of the cluster.

Debugging

My main problem is usually not reading the error messages carefully enough but there was definitely nothing I could work out so:

I needed to attempt pg_basebackup on the source database server since this is what barman was trying to run. This would immediately tell me whether the problem was in the source server or whatever it was barman/network config that was wrong. This also failed with the same error (and also didn’t tell me where this folder was relative to). This “reduction” technique to a smaller set of variables is a cornerstone of debugging and something that not everyone thinks to do.
I did an ls -l on the data directory for the database. I could see that all the files seemed to be correctly owned by the postgres user except one! One called “main”. I don’t know where it came from, possibly from Ansible, which might explain why it was owned by root and not by postgres. This might also explain why it is not a common error on the internet since it is unlikely to normally happen.

The folder was empty and not required but its presence was enough to trip over pg_basebackup so all I had to do was delete it and everything worked perfectly!

SPAs are killing the internet

2024-01-30T13:57:00+00:00

What’s wrong with the internet?

Well lots of things are wrong with the internet. Trust is hard to establish, crimes are committed, people are named/shamed/blackmailed online and other difficult problems but one problem is not difficult but has become that way and that is the reliability of online applications.

You know the sorts of things:

You click a button and something doesn’t quite look right, you don’t know if the action was successful
You do something and get some abstract error, perhaps in a toast message
You are using an application that seems like it should be easy e.g. book a Doctor’s appointment, but for some reason, it is really complicated, it’s performance is terrible

In fact, I would go as far to say that today, if I use an application that just works as-expected, I am actually surprised because most apps are not like this.

Why are apps so bad?

Of course there are various reasons but I am going to make a bold statement that most problems with most applications is the industry’s obsession with Single Page Application frameworks or more specifically, an application that consists of front-end code in Javascript driving an API backend. Most applications do not need this complexity so why does everyone do this?

The promise of front-end frameworks

Much more UI-rich applications. This is a lie for the simple reason that all front-end frameworks actually do is html, css and Javascript so everything you can do, you can already do with any web framework
More responsive. Another lie. The only time this is true is in the most simple case where you are doing some fairly lightweight transient changes like folding things up and down, hiding and showing etc. As soon as you add an API call, you either need to wait for it (if it is important) or otherwise continue whether it failed or not. Most responsiveness is easy enough to achieve with a backend framework + some basic JS/CSS using a lightweight framework.
Your backend devs can also work on the front-end: everything is Javascript. Again, not true. Most of our Frontend devs know very little about hosting and databases so even if we did use Javascript on the backend, we would not benefit from this proposed advantage. We are not short of backend devs so we don’t need people to work in both. In fact, some of our C# devs can do a reasonable amount of Vue js so it is possible to know more than one language.
The richness of the library eco-system. Richness or the vast reams of mostly defunct or unmaintained packages? There are plenty of great libraries out there but how do you find them? Google? Stack Overflow? We have chosen certain libraries that have lasted about a year before the maintainer gives up for whatever reason. Why don’t we just develop our own libraries? Ever heard of DRY? We don’t do it because others have learned more than we will.
Removal of workload from the backend. Again, this is only true in a VERY small niche of applications. I can accept that people like Facebook or Google with stateless servers can squeeze worthwhile gains out of this benefit but our system of 3 instances of backend applications behind a load balancer is not even struggling with 10K concurrent users. How many of us really need this? Also, it takes a large amount of discipline to do this since your APIs are having to do work, albeit potentially without processing state, but the backend will have to validate most of the same things that the front-end also has to so not much benefit here.
Easy to knock up new pages based on the existing controls. Ha. Maybe in the most simple scenario like a listing control but in most cases, every page has its own challenges, its own edge cases that require new components or adding even more complexity to the existing components. Honestly, I could knock up most of this stuff with jQuery and a Razor page in a 10th of the time it takes in FE code. I also don’t need to wait for someone to create me an API endpoint that might or might not return exactly what I want.

So should we really just use a backend framework?

OK, so front-end frameworks don’t deliver most of what they promise, but does it really matter? I think it really matters. Honestly, most people should stick to .Net/Java/Ruby/Python or whatever language you like since most of the best frameworks in these languages are already great.

Try debugging a front-end app! I am porting an older app over to Dotnet Core right now and some things don’t work. I can run up the app directly from Visual Studio, I can look at the rendered page and because of model binding in the request, I can see if it should work. If not, I can debug into the Action and see whether the parameters are populated. If not, I am likely to get some error or I can look at the raw request: “Oh yeah, unchecked checkboxes are not sent in a form”. Do the same thing in the front-end? Good luck. The data probably hits at least 10 files in its journey, any of which might not work and you don’t get great debugging in Javascript without peppering your code with debugger statements or console.logs.
Is webpack a great thing? Webpack or whatever you use to bundle your front-end app is REALLY complicated. So complicated that I reckon most people probably copy and paste previous examples and tweak them to suit because, again, crazy numbers of packages and loads of options. It is clever, for sure, but it is not nice.
What about yarn install and build? Our app only has about 3 JS heavy pages using Vue components but the build and install adds around 10 minutes to our CI build. The .Net app builds in about 30 seconds. Everything about these tools is horrible but no-one has been able to fix them, eventually someone else just say, “Yarn? Ha. We all use Vite now” as if they have fixed the underlying problem and haven’t.
I reckon 75% of our problems in production are related to the massive complexity of front-end code. You want some nice UI dependencies? Yeah easy. Oh wait, this breaks that page and the scrolling doesn’t work properly any more etc. They also take much longer to debug and fix on average. Again, too many moving parts in front-end frameworks and most of these are just duplicating a lot of what you already have in the back-end anyway.
You can actually get great response times in a back-end framework too. You can use Ajax to update a single element on a page instead of reloading everything. .Net had this in Web Forms 20 years ago and it worked great because the browser didn’t have to repaint everything. We are seeing single digit ms database access times and page rendering in roughly 100ms. Is that really that bad compared to how snappy JS is supposed to be?

What am I saying?

Simple. I have worked on roughly 10 major (public facing) applications over 20 years and none of them needed front-end code. Front-end code is ultimately just rendering HTML, CSS and Javascript so it cannot do anything magic it just pretends it can.

The joke is that people are already talking about server-side-rendering of front-end code because of these problems! You couldn’t make this up.

Maybe in another 10 or 20 years, people will start to evaluate technology based on pragmatism more than idealism and start to see that in most cases, a decent backend framework is all you need!

Random dotnet core routing errors

2023-12-30T12:42:00+00:00

The scenario

I am working on a migration of an older MVC5 (.Net Framework) app to a newer dotnet core 8 app. The Visual Studio Migration wizard does an OK job of creating the dotnet 8 app alongside the MVC5 app and then you can select individual classes/controllers/views etc. and “Upgrade” them which involves it being copied to the new app and some basic transforms made to it, like replacing System.Web namespaces with Microsoft.AspNetCore or whatever.

There are a number of things it cannot reasonably do automatically so you then need to pay down the debt of poor technical decisions since if you used nasty things like WebBag or various other fiddles with HttpContext then you are likely to be scratching your head or doing endless identical changes to make it compile.

The routing problem

The way the upgrade works is that it adds Yarp, which is a software reverse proxy which can forward any un-mapped routes to the old app via a localhost URL. All you have to do is make sure that logging into the new app also logs into the old app and then it should all be grand.

It was basically working OK and I have made hundreds of changes but then I noticed something weird recently, that was not related to any recent changes but since I was not always running the app up it wasn’t clear at what point it broke.

The problem was that if I clicked onto various menu items or indeed, the dashboard link (located at /), a random action was invoked and displayed and the Information level logging in the terminal window indeed showed something like, Request: https://localhost:7100/ followed by Executing endpoint: /Reports/ReportTool/Index. What? It’s not even a close match.

Debugging route problems

What I should have realised/tried very early on was to increase the logging level to Debug (using Serilog in my case) since the problem would have been much clearer. I did not however, and attempted to debug into the framework code, which isn’t too hard but the code is very complicated and some of it is optimised away so I only learned what I found out later but with no more insight: There were two routes that matched the request and as it had “found” the one inside the Reports area first, that was the one that was returned.

Now the debug loggin was much more specific (although there is quite a lot of noise too!) but it said essentially, Matches the route: /home/index using the pattern {controller=Home}/{action=Index}/{id?} and also Matches the route Reports/ReportTool/Index using pattern {id?}

Ah OK, I can see why it would match a route that has a single optional part called id but how on earth did this happen? None of my routes in Startup.cs that use MapControllerRoute looked like that.

There are a couple of nuget packages designed to help you debug routing but one is based on the .Net framework Web API and the others don’t really give you very much other than confirming the route that has been chosen. The Debug logging tells you pretty much everything you need to know.

Two types of routing

Then a small bell rang in my head because this has bitten me before, there are two types of routing in dotnet core: Conventional routing and Attribute-based routing. Attributes can have unexpected effects on Conventional Routing and that must be what happened here.

The problem with the MS documentation is that it either tends to be over-simplified without the details we want, or it is so complicated, you have to read it several times to understand it. Microsoft Learn have helpfully separated out “Routing”, “Controller Routing” and “Attribute Routing” into separate documents but it feels like it could do with a refresh and make the basic points clearer before diving into too many examples and details.

My app did have conventional routing setup correctly for both root-level Controllers and also area-level Controllers so it should have just worked but I had made a mistake by using the HttpGetAttribute. This is an important attribute (along with its siblings) to help the router decide which of two actions with the same name are selected, usually with a Get/Post pair for MVC views. However, there is a subtle trick which is not obvious, when I used the overload that allows a route to be included in it. So a very abbreviated copy of my class was like this:

[Area("Reports")]
public class ReportToolController: Controller
{
    [HttpGet("{id?}")]
    public IActionResult Index()
    {
        return View();
    }
}

The AreaAttribute on the class is fine, and is necessary to automatically wire up the area-level controllers but the HttpGetAttribute of specifically the one that takes the route parameter is the problem because it implicitly behaves the same as this:

[HttpGet]
[Route("{id?}")]

The RouteAttribute is the problem because it has context dependent behaviour. If the class also had a RouteAttribute then it would be combined with the route on the action unless the action route starts with a ~ or a / in which case it is treated as an absolute path, something that might be helpful for weird endpoints like /login which would otherwise require a customised controller route to be mapped.

However, in this case, there is no RouteAttribute on the controller class so even though it does not start with ~ or /, it is still treated as an absolute route instead of potentially either throwing an exception “Route is not specified as absolute” or perhaps just a plain 404. Instead, it is creating this additional route with a pattern of {id?} which, of course, is very greedy and was grabbing most other pages and being very confusing.

Conclusion

Understand routing and enable debug logging if you get weird routing errors!

Computer Student

Perfection is a Software Engineer’s worst enemy

Perfection or Pragmatism

Code Detail or Scope Creep

Where to draw the line

Business is business

Approaching a performance problem

Background

Why things might be slow

Using the right tools

The benchmarks

My approach

Problem 1 - Get latest response

Problem 2 - Get plan for feature

Problem 3 - Getting translations

Should I do anything else?

What did I learn?

Dotnet core not binding a json request to an action’s model

Background

Problem 1 - [FromBody]

Problem 2 - The whole request needs to be valid Json

NPM in Docker is a dumpster fire

Background

Ubuntu Jammy 22.04 + aspnet 8

Option 1 = apt-get install node npm

Option 2 = nvm

Trying to get source to work

Option 1 - Re-link the shell executable

Option 2 - Set the SHELL

Option 3 - use the login switch

Maybe I need to update npm too?

Eventually I got to the end of the build layer

Going back to the start

Debugging tips

Debug the state of your container

Build your own base images

Consider your own apt cache server

The “you’re over complicating it” meme

The Summary

The zero-sum fallacy

The strawman website

The timing fallacy

The refactoring fallacy

The lazy conclusion

The missing middle

The funky Edge data

Ah yes, the cost

The reality of the situation

Conclusion

Nullable types slow down dotnet core

PostgreSQL Major Cluster Upgrade - Nearly Zero Downtime

Logical Replication

The Approach

Reusing the Current Replica VMs

Scripting the VM installation

Preparing and implementing logical replication

Big tables

wal_level

schema export/import

replica identity

Create publication

Create subscription

Application testing

Sequence setting

Backups

New cluster replication

Monitoring

Swapping over

Apps with static database data

Apps with moving data

Problem apps

Conclusion

Barman backup error

The scenario

The error

The problem

Debugging

SPAs are killing the internet

What’s wrong with the internet?

Why are apps so bad?