Troubleshooting Cisco Catalyst Stack Switch Discovery Issues
Use this when Cisco Catalyst stack members are stuck in discovery or fail to reach Ready state.
Quick Read
- Symptom: Use this when Cisco Catalyst stack members are stuck in discovery or fail to reach Ready state.
- Check first: Run `show switch` and confirm each expected member state.
- Risk: Destructive
Symptoms
One or more Cisco Catalyst stack members do not finish discovery or do not reach a ready state. The stack may show missing members, version mismatch, removed/provisioned members, stack port down state, or repeated discovery messages.
Environment
Cisco Catalyst switch stacks, such as Catalyst 9300 series, running IOS XE with StackWise cabling and multiple stack members.
Most Likely Causes
Stack discovery failures are commonly caused by loose or failed stack cables, an open StackWise ring, member-number conflicts, priority/version mismatch, incompatible IOS XE versions, power or hardware faults, or stale provisioned member configuration. Less common causes include bugs in a specific IOS XE release or a member that cannot pass hardware diagnostics.
What to Check First
- Run `show switch` and confirm each expected member state.
- Run `show switch stack-ports` and confirm stack ports are up and the ring is healthy.
- Check stack cable seating, StackWise port LEDs, and whether the ring is open.
- Review logs for stack member, version, election, or hardware diagnostic errors.
Insight Cluster
Parent question: How do we isolate edge and secure-access incidents by separating provider handoff, switching, VPN/auth, and policy enforcement before broad network changes?
- Planning Network Edge, Access, VPN, and Switching Failures Without Guessing (parent Insight)
- Comparing Network Edge Validation Paths for DHCP, VPN, Switching, and Policy Failures (supporting Insight)
- Network Edge Evidence-First Comparison Between Good and Broken Paths (supporting Insight)
- Troubleshooting CORS Error: Permission Denied for Requests in Chrome on Office Network (tactical leaf)
- Troubleshooting LACP Sub-Interfaces Communication Issues with Core Switches (tactical leaf)
- OPNsense WAN DHCP failure after a MAC address or ISP lease change (tactical leaf)
- Accelerating Discovery for Stuck Switches in Stack (tactical leaf)
- Troubleshooting IPsec Connectivity Issues on pfSense with DrayTek (tactical leaf)
- Troubleshooting Zscaler ZCC VDI Intune Win32 App Command-Line Limit Failures (tactical leaf)
- Troubleshooting FortiClient SAML Authentication Errors for IPSEC VPN Connections (tactical leaf)
- Troubleshooting IPSec VPN Issues on FG-90G Firmware 7.4.11 (tactical leaf)
- This parent cluster is meant to stop network edge and secure-access pages from being treated as disconnected firewall, VPN, and switching incidents.
- The supporting pages frame branch selection and good-vs-broken comparison before the reader drops into exact WAN, stack, VPN, or policy failures.
Fix Steps
- Capture stack member state
Start with the current control-plane view of the stack. Record member number, role, priority, MAC, version, and state before making changes.
Example pattern only. Adjust for your environment before running.
show switch show version
- Check StackWise port health
Use StackWise-specific port output and physical LED/cable inspection to determine whether the stack ring is closed or broken.
Example pattern only. Adjust for your environment before running.
show switch stack-ports show interfaces status
- Review stack logs
Look for member join, election, version mismatch, StackWise link, or hardware messages around the time discovery stalled.
Example pattern only. Adjust for your environment before running.
show logging | include STACK|Stack|SWITCH|VERSION|DIAG show logging
- Run supported diagnostics during an approved window
Use platform-supported diagnostic commands for the switch model and IOS XE release. Confirm the exact syntax in Cisco documentation for the target platform before running diagnostics; some diagnostic commands are platform-specific and some tests can be disruptive.
Example pattern only. Adjust for your environment before running.
show diagnostic result switch <switch-number> hw-module switch <switch-number> test online
- Reload only after evidence points to stale discovery state
A reload interrupts traffic. Use it only when cabling, member state, logs, and maintenance approval support a controlled stack reload, and confirm console or out-of-band access before proceeding.
Example pattern only. Adjust for your environment before running.
reload
- Back up configuration before any destructive reset
If a factory reset or stack configuration reset is being considered, take a current backup first and verify restore access.
Example pattern only. Adjust for your environment before running.
copy running-config startup-config copy running-config flash:pre-stack-reset-backup.cfg show startup-config
- Reset configuration only as a last resort
This is destructive. write erase removes configuration and must be used only with a verified backup, console access, restore plan, and approved outage window.
Example pattern only. Adjust for your environment before running.
write erase reload
Validation
- Run `show switch` and confirm all expected members show Ready state and correct active/standby/member roles.
- Run `show switch stack-ports` and confirm the StackWise ring is healthy with expected ports up.
- Review logs after the fix and confirm discovery, version mismatch, or stack port errors do not continue.
- Confirm downstream links and VLAN trunks are passing traffic after any reload or member recovery.
Logs to Check
- Cisco IOS XE `show logging` output around member join/discovery time.
- StackWise port and member state output from `show switch` and `show switch stack-ports`.
- Hardware diagnostic output from `show diagnostic result switch <n>`.
Rollback and Escalation
- Restore the pre-change running configuration if reset or reconfiguration creates service impact.
- Replace a failed stack cable or isolate a failed member if diagnostics point to hardware.
- Rollback IOS XE only through the organization's standard image management process.
Escalate When
- Escalate before running `write erase` or any factory reset command.
- Escalate when stack cabling looks healthy but member diagnostics fail.
- Escalate when a production reload would affect redundant paths, access switching, or uplinks without a maintenance window.
Edge Cases
- A stack can limp with an open ring but become fragile; fix the physical StackWise ring instead of treating it as only a discovery issue.
- A provisioned-but-missing member may be expected after hardware replacement unless member numbering is cleaned up intentionally.
- Mixed IOS XE versions can keep a member from joining cleanly even when cabling is correct.
Notes from the Field
- A real first check is often physical: StackWise cable seating, port LEDs, and whether the ring is open. The CLI confirms what the rack is already trying to tell you.
- Treat `write erase` as a recovery operation, not normal troubleshooting. A stack reset without a config backup turns a discovery issue into an outage.