When developing systems that must wait for key things to happen (often, things that are outside their direct control), we need to consider how long we’re willing to wait for an answer.

It’s tempting - and easy - to hard code these wait times into the system, deciding that we’ll always wait that long for it to happen. I’ve done this myself.

However, doing this introduces subtle instabilities and problems. We can - and should - try to do better.

Here are some examples.

Example: waiting for a service to be available

Context: Writing a scripted test that needs to start up a service, wait for the service to finish initializing, and then run a series of tests to ensure the service is behaving properly.

The easy approach would be to hard code a delay - say, 30 seconds - after starting the service, to give it time to start up.

ProVery easy to code.
ProUsually works.
ConEvery test has to wait for 30 seconds.
ConSometimes doesn't work.

Result: Unstable tests that sometimes fail because the service hasn’t yet finished initializing.

We could fix this by changing the delay to a longer value, say 60 seconds.

ProVery simple change.
ProWorks almost all of the time.
ConEvery test now has to wait for 60 seconds.
ConDoes still fail sometimes.

While this improves reliability, we don’t reach 100%; our tests are still flakey. Worse, we’ve slowed down our entire test suite to do it.

A better fix is to find a way to monitor the server to see if it is ready for use. One way that works for many services (if you’re running them on the current machine) is to see if their associated network port is accepting connections.

ProOnly waits as long as required.
ProMost tests can start in just a few seconds.
ProCan wait much longer for the P99 case.
ProBetter diagnostics when things go wrong.
ProNo longer have unstable tests.
ConHarder code to write.

Note that the code is harder to write once but gives benefits over and over again. It’s worth the effort.

Waiting for something to be created

Context: Your component needs to wait something to be created by someone else. Perhaps you’re waiting on a dynamic database table to be created; perhaps a new queue in the cloud; perhaps a message channel in your sevice bus.

The easy approach is to work out how long it typically takes for that item to be created by the other component of your system, and to go to sleep for that long.

ProReally easy to set up - just guess a value and test in production to see if it's good enough.
ConRelies on the other system taking a predictable length of time.
ConNot reliable if things go slow (say, due to Storage throttling).

Instead, try testing for the existence of the queue - waiting only until it exists

ProDon't need to predict the runtime of the other component.
ProCan wait much longer if required (say if the other component is running slow).
ProFaster startup time in the normal case.
ConMore complex code to write and maintain.

Once again, note that the code is harder to write once but gives benefits over and over again. There’s a pattern here.

Ensuring a process completes

Context: Creating a watchdog to ensure that a separate process doesn’t run for too long. Perhaps you want to detect a runaway process for telemetry purposes, perhaps you need to protect your system against a rogue actor.

You could set an absolute time limit for the process - say 15 minutes.

ProEasy to code.
ConTakes up to 15 minutes to detect a problem.

As an alternative, actively monitor the process to see what it’s doing - you can then tune your tolerances to only admit behaviour that’s acceptable.

Here are some possible rules:

  • Must output to stdout at least once per minute.
  • Must not accumulate more than 10m of total CPU time.
  • Must not run for more than 30 minutes wall clock time.
ProMuch faster detection of stalled processes without constraining processes that are making progress.
ProTighter constraints limit any negative impact on the rest of the system.
ProBetter diagnostics when things are killed by the watchdog (you can report on what rule was violated).
ConRequires more of the process being monitored.
ConMight constrain legitimate use.

Conclusions

Hardcoded time frames are often a mistake because they prevent the system from adapting to the environment in which they are running. Dynamic constraints give your applications an opportunity to self-tune, to report better diagnostics, and to achieve greater reliability.

Comments

blog comments powered by Disqus
Next Post
Converting projects to new csproj  02 Jun 2018
Prior Post
Sharpen The Saw #36  21 May 2018
Related Posts
Superpowers for your AI  29 Mar 2026
Autotitling Windows Terminal Tabs  17 Mar 2026
Don't assume shared understanding  25 Jan 2026
Better Table Tests in Go  21 Oct 2025
Error assertions  26 Apr 2025
Browsers and WSL  31 Mar 2024
Factory methods and functions  05 Mar 2023
Using Constructors  27 Feb 2023
An Inconvenient API  18 Feb 2023
Method Archetypes  11 Sep 2022
Archives
May 2018
2018