Not everyone gets fault finding

I think one of the most common causes of frustration for me is when my expectations, expertise or understanding is not aligned with the people I either work with or for. This isn’t just technical. If we having a business or product discussion and you see some perhaps subtle but important point and you explain it but the other people are either dismissive or they just don’t understand, it can breed resentment. It can be really hard when you are a deep thinker and can analyse complex things very quickly that others can take more time to catch up (sometimes a lot more)

I realised this today with regards to Fault-finding. Some people can just instinctively do fault-finding well and efficiently. They can either narrow down a problem to find the solution or they can quickly get to the limit of their expertise and then submit to some higher authority (this is a fault in the kernel and I know nothing about that). Other can’t. They either take a very long time to get there, they don’t get there at all or their conclusion is wrong.

I remember my first job working on the railways and one of the modules was fault-finding. I remember the “half-split technique” where a train problem could be quickly divided by splitting the train down by 50% each time until you narrowed the problem to a single carriage. I remember thinking this was really obvious. If you had a problem somewhere along the train, it seemed obvious that by using division, an 8 carriage train would need no more than 3 tests to narrow down the carriage that needed investigating but I didn’t realise that for some people this wasn’t obvious. Some people might have simply tested each coach in turn. On average they would have needed 4 tests instead of 3 but for a 12 coach train, that would have gone up to 6 whereas half-split would have taken only 3 or 4.

For people who are not good at fault-finding, one of the biggest dangers is bad assumption. In order to fault-find, you need to have a hypothesis - experience helps you know what often goes wrong - and then the ability to test the hypothesis. That is good assumption. Bad assumption is a number of things. Firstly, assumptions with a lack of experience can eat up time. For example, assuming off-the-bat that there is a Linux kernel bug is probably unwise. Of course, there are kernel bugs but they are generally very few and unlikely to affect most people who are using it. Secondly, assumptions that cannot be easily tested can also eat up time. In many cases, this is expertise. If you suspect a network problem but don’t know how to effectively test for it, either you use very ineffective or clunky measurements or worse, thirdly you fix for your assumption before you have tested whether the hypothesis is correct. At best, it can be a waste of time (or maybe an accidental fix) but at worst, it could cover up the actual problem which might be squeezed out later - a bit like allocating more memory to something that is leaking memory which appears to fix something until later!

I see a lot of times where we need to fault-find in development. It is strange the difference between how many times you think standard code should go wrong in an unusual way (not just a simple typo etc) and how many times it does. I have seen various problems with build servers, some caused by network problems, some by code changes, some by test database changes but in some of these cases, the fault-finding took too long.

One time, we had these particular build servers having randomly failing tests. Since another build server ran consistently well, this is enough information to make some reasonable assumptions: 1) It was not code, otherwise all build servers would fail and the failing tests would consistently fail 2) It was not the build server configurations since they are shared across all build servers. So the next step was to look at what was happening to cause the tests to fail. We mostly use Playwright for UI tests running in headless mode for performance and stability. It is crucial that you have an easy means to disable headless and also to run tests on the build servers directly so that you can compare the build server with local running (the local tests always ran fine, like the good build server). Anyway, we couldn’t do that (disable headless) but we noticed that in all visible cases, the failed tests were timing out on some kind of UI loading e.g. waiting for a page to load. Since there was a good server that worked fine and which used all of the same configuration and databases as the broken servers, a good assumption was that there was some kind of network issue BUT since the failed tests were fairly random, it was more likely to be an intermittent network issue rather than something like a firewall.

So this was an area that I was not particularly experienced in - I can use Fiddler but what about wireshark? Anyway, I asked our hosting providers for any reasons this might be happening, we certainly hadn’t changed anything that could have affected this. They told me that there was nothing strange. I made the (wrong) assumption, that it could not be a network issue if the hosting provider said there were no issues. I also noticed some other things like virtual network adaptors (these are all VMs) but again, although it was conceivable these would cause random issues, I had them removed and it was no better. I ran up Fiddler and I could see that certain requests were timing out. For example, it might load something from our CDN and sometimes it worked normally/quickly, sometimes it didn’t work at all and timed out. Weird. Since it was, fortunately, happening across different hosts, it was unlikely to be an issue with all of the providers (Google, Amazon etc) although if it did only affect one supplier, that would be a good thing to test/reduce. This is another thing that people don’t often do - reduce the variables down to the smallest reproducible problem. If you can only run a handful of tests, don’t waste time running 25 minutes of tests each time. I still didn’t really have enough information to back to our provider and tell them that there is a problem!

Enter wireshark. Very powerful but worth a couple of basic lessons, mainly about filtering, to ensure you don’t capture too much data. The other thing is that each of the failures looked the same so I only needed to capture one of these to have enough information, again, this allowed me to reduce the amount of data capture. Once I had captured my packets, wireshark allows you to highlight a chain of messages from the same TCP connection so I compared a good one and bad one. A good one, like you expect, has SYN, ACK, SYN ACK, message(s), FIN whatever. The bad one had a SYN, which timedout, a legacy SYN, timed out and then it bailed. This was nice and clear-cut. Sometimes TCP connections were attempted and literally got no responses and eventually timedout. The browser would retry and usually be successful next time so some tests were slower and some failed. I went back to the provider and insisted, after which (2 weeks I think!) they came back having found some dodgy routing that had been left configured on one of our switches which was redirecting some traffic (possible the inbound) to the wrong place!

Anyway, this is what I have realised:

Some people are instinctively good at fault-finding, some are not (but they can usually learn)
You should work with testable assumptions
You need experience to choose the most likely problems first
You need expertise in the various tools available to developers to effectively test things
You should reduce any complex issues down to the smallest reproducible problem to reduce variables
You should only fix things once you have proven firstly the problem and then proved it fixed after your code change