Time outs and Application Integrity

When we design software systems, we tend to think of perfect real world scenarios where all connections succeed, where services are always available and where the network is always available and running quickly. Sadly, many of us don't really consider what happens if some of these are not true at any given point. Perhaps in our system, it is acceptable to simply allow an error or timeout and for the user to hit refresh (as long as this actually works!) but for some systems, especially ones that require many screens of information to be navigated and saved into the session, this might not be adequate.
The company I work for writes a system to apply for mortgages. Various screens of information are captured and various service calls are made along the way. We had a defect raised the other day which appears to have occurred because something took slightly too long and the page displayed an error. The problem was, the system was then left in an undetermined state. The user interface was happy to try the screens again but the underlying system had completed what it was supposed to do and therefore would not return a 'success' code. There was no way to easily fix the problem, it would have required some direct hacking of a database to attempt to force the case through. In the end, the customer simply re-keyed the whole case but this is not always easy and is not popular if it happens too frequently.
This was a case where we hadn't properly thought through the possible error conditions. In our case, the risk is increased due to the numbers of people using the web app so it is not surprising that these things happen. Not only does it not look too good but it takes time and money to investigate the problem even if we end up not being to fix it. It is also not great as an engineer guessing as to the cause in the absence of any other discernible problem.
The solution as with all good design is to consider all aspects of the work flow that are outside of our control. This means the server(s) that is hosting the various parts of the system, the network and any third-party calls to services or other apps. We need to consider not only the timeouts, which in our case we do, but also what happens if something times out at roughly the same time the underlying system has completed successfully. This is especially true if you are using multi-threaded applications (which in many cases for web apps you will be). Ideally, your system should be able to go back in time to a known-good state from all pages but certainly those which are higher risk (which call onto third-party services etc) which should cancel and delete all current activity and take you back to a place that you know the system can recover from. This might be slightly annoying to the user but much less so than a full-on crash. You can also mitigate this risk by displaying a user friendly message such as, "the third-party service was unavailable. Please press retry to re-key the details and try again".