Durable Runs: Introducing Automatic Retries in Tower

In a perfect world, every Tower application run completes successfully on the first try. But in reality, networks flicker, third-party APIs time out, and sometimes, things just crash.

We’ve heard from teams like Inflow Systems and Pyne that manual intervention for these transient failures is a major friction point. That’s why today, we’re excited to introduce Automatic Retries.

Why Retries?

Failure happens in two main flavors:

Application Logic: A crash on the developer’s side where a second attempt might solve the problem without a code change.
Infrastructure Issues: A hiccup on the Tower side (the "Tower needs to fix something" scenario).

With this update, Tower can now handle both. If your run hits a bump, Tower can automatically reschedule it based on a policy you define.

The New "Retrying" State

To support this, we’ve updated our run execution lifecycle. While most of this happens under the hood, understanding the states helps you debug more effectively.

A run now moves through these states:

Scheduled & Pending: Queued and ready to go.
Running: Currently executing.
Exited: Success! 🚀
Cancelled: Stopped by a user or the control plane.
Crashed: An error on the user’s side (requires a fix).
Errored: An error on Tower’s side. Rare but it does happen 😜.
Retrying (New!): The run is in a "holding pattern," waiting for its next attempt.

Attempts vs. Retries

When configuring retries, it's important to understand how Tower counts attempts: Tower will make a total number of attempts equal to your configured Retries setting plus 1.

This means there will always be the original attempt at executing a run, and if that attempt fails, Tower will then perform N additional retries according to your configuration. So for example, if you configure the system to perform 2 retries, Tower will actually make 3 total attempts at running your application.

Setting Your Policy: The Idempotency Rule

You can now define a retry policy at the App level or override it for a single run. You can configure:

Max Retries: Up to 10 attempts.
Retry Delay: How long to wait between attempts.
Exponential Backoff: If enabled, Tower will double the wait time after each failure (e.g., 30s, 60s, 120s) to give downstream systems time to recover.

One critical note: Tower cannot determine if your application is "safe" to retry. We only recommend enabling retries for idempotent apps. What is idempotency? An app is idempotent if running it multiple times has the same effect as running it once. If your app sends an invoice each time it is run, retrying it might send that invoice twice - use caution!

Get Started

Retries are available now via the Tower UI and API.

To enable: Head to your App Settings or check out our updated documentation for the API spec.
To monitor: You can view the full attempts history and specific logs for each retry directly in the Run Detail view.

Build more resilient workflows today - let Tower handle the hiccups so you can focus on the code. Join the Tower Discord community to talk to other users and the team today, or try Tower for free yourself here.