Error “Timed out waiting for running jobs to complete”
Summary
You may sometimes see this type of error in the job activity logs:
Timed out waiting for running jobs to complete
Details
This error is caused when a job cannot be started on a cluster due to some other running job on the same cluster. In this case, the job waits for 3 hours before failing with the above message.
There are few different scenarios in which this can occur. For example if there are two different backup jobs “backup 1” and “backup 2” that are scheduled for the same cluster at different times and if “backup 2” started when “backup 1” hasn’t finished, it will wait for at most 3 hours before failing with the timeout message. Note that this is different from same backup getting triggered second time while its previous run is still going on. For example, if there is a backup called “hourly backup” which runs every hour and if one run took more than an hour for some reason, the next run will be skipped with this message:
Another job, from the same job definition, is already running in cluster <CLUSTER-NAME>
Another thing to consider is the concurrency limit configured for the cluster (see Running multiple jobs concurrently for details). Even in this case, jobs may need to wait either due to some of the limitations (described at the previous link) or if the cluster is already running maximum number of jobs.