Skip to main content
Live. This area is documented as current, user-reliable behavior.

Goal

Troubleshoot stacks as service systems, not just as single containers.

Prerequisites

  • An existing stack

Workflow

1
Start with the stack detail page, health state, and recent logs.
2
Check placement and node health when the stack never stabilizes.
3
Use recovery-state messaging for backup, restore, or template-upgrade issues.

Runtime vs placement

  • A stack that crashes or restarts repeatedly is usually a runtime problem — read the logs and per-service health first.
  • A stack that never stabilizes is often placement: pinned to an unhealthy node, or least_loaded with no node matching its selector tags.

Template drift and upgrades

Template-backed stacks report an upgrade status: up_to_date, update_available, upgrade_blocked, or unknown. An upgrade_blocked status means the upgrade cannot apply safely as-is — preview the upgrade before applying, and remember a template upgrade can be rolled back.

Backup and restore failures

  • A failed volume archive leaves the backup incomplete — only restore from a completed backup.
  • An agent that is too old for the volume endpoints will fail backup or restore; upgrade the node agent.
  • Watch recovery-state messaging through a restore instead of assuming the status badge alone means success.

Expected result

You can tell whether the problem is runtime, placement, recovery, or template-related.

Common failures

  • No healthy node matches the stack selector tags, so placement never lands.
  • A template upgrade reports upgrade_blocked and cannot apply without changes.
  • A volume archive failed, so the backup is not safe to restore from.

Stack logs, health, and placement

Use the stack detail, logs, and placement information to understand how the stack is actually running.

Back up and restore a stack

Use S3-backed named-volume archives to protect and recover stateful stack data.

Recovery states, logs, and troubleshooting

Read the operation state on a resource — its status, current step, attempt count, retryable flag, and last error — together with logs, instead of treating a single “error” badge as the whole story.