Backups are not the same as recovery. A business only knows it can recover when restore paths are documented, tested, monitored, and aligned to business priorities. The organisations that recover well from ransomware, hardware failure, cloud outage, or human error are the ones that converted backup assumptions to evidence before the incident, not after.
This is a practical resilience checklist. The questions are deliberately platform-neutral. The same answers are needed whether the environment runs on Veeam, Azure Backup, AWS Backup, or a mix of all of them.
What are your critical services?
List the systems the business cannot operate without. Include identity, email, finance, line-of-business applications, databases, file storage, network connectivity, and customer-facing services.
The answer is rarely a single tier. Most businesses have:
- Tier 1 — mission critical. Without these, the business stops operating within hours. Usually identity, email, primary line-of-business application, payment processing, and customer-facing services.
- Tier 2 — business critical. Significant disruption within a day. Finance systems, HR, internal collaboration, secondary line-of-business applications.
- Tier 3 — important. Disruption tolerable for several days. Reporting, analytics, archived data, secondary departmental tools.
- Tier 4 — low priority. Disruption tolerable for a week or more. Archive content, historical records, departmental utilities.
The exercise itself is valuable. Most organisations discover that the priorities the IT team assumes do not match the priorities the business owner agrees with. Identity is often under-prioritised; some applications are over-prioritised because they were once critical and the inertia stuck.
What are the recovery objectives?
Each critical system needs a recovery time objective (RTO) and a recovery point objective (RPO). These define how quickly the service must return and how much data loss is acceptable.
| Tier | RTO | RPO |
|---|---|---|
| Tier 1 | 1–4 hours | 15 minutes – 1 hour |
| Tier 2 | 4–24 hours | 1–4 hours |
| Tier 3 | 24–72 hours | 4–24 hours |
| Tier 4 | 72 hours+ | 24 hours+ |
The exact values depend on the business. The discipline of stating them — for every critical system, with executive sign-off — is what matters. RTO and RPO without executive sign-off is an IT opinion. RTO and RPO with sign-off is a contract that can be tested and held to.
The numbers also drive the architecture. A 15-minute RPO requires continuous replication or near-continuous backup. A 4-hour RTO for a complex application requires pre-provisioned recovery infrastructure, not “we’ll spin up a new server and restore”. The architecture cost varies by orders of magnitude across these tiers — which is why categorisation matters before architecture.
Are backups isolated and protected?
Backups should be protected from accidental deletion, compromised admin accounts, ransomware, and platform failure. Consider immutability, separation of duties, restricted access, and offline or isolated copies where appropriate.
The minimum protections we look for:
- Immutability. Backup data cannot be modified or deleted within a defined retention window, even by an administrator. Azure Backup, AWS Backup Vault Lock, Veeam Hardened Repository, and similar mechanisms enforce this technically.
- Separation of duties. The credentials that operate production systems are different from the credentials that operate backup systems. The backup admin cannot be the production admin and vice versa.
- Restricted access. The backup management plane has its own MFA, conditional access, and ideally lives in a separate identity tenant or a hardened administrative segment.
- 3-2-1 minimum. Three copies of the data, on two different media types, with at least one off-site or in a different cloud region.
- Air-gapped or logically isolated copies. At least one backup copy that cannot be reached from the production environment under normal operating conditions.
- Documented retention. Defined retention periods that match legal and operational requirements, not “as long as the storage holds out”.
Ransomware operators target backups specifically because compromised backups eliminate the safe-recovery option. The most expensive ransomware incidents are those where the backups had been encrypted, deleted, or made inaccessible before the production attack.
Have restores been tested?
A backup that has not been restored is an assumption. Recovery tests should validate data integrity, permissions, application dependencies, DNS, certificates, identity, and user access.
The test cadence we recommend by tier:
| Tier | Restore test cadence | Scope |
|---|---|---|
| Tier 1 | Quarterly | Full restore of representative workload, end-to-end validation |
| Tier 2 | Half-yearly | Sample restore, integrity validation |
| Tier 3 | Annual | Sample restore |
| Tier 4 | Annual | Sample restore |
The “end-to-end validation” for Tier 1 is the part most often skipped. Restoring a database to a recovery environment proves the data file is recoverable. It does not prove the application starts, the integrations work, the certificates are valid, the DNS records resolve, the identity provider authenticates the recovered service, and the users can sign in.
A real Tier 1 restore test simulates the full recovery sequence: restore the application, restart the integrations, validate the data, sign in as a representative user, complete a representative transaction. The first time this is tested it usually fails for reasons unrelated to the backup itself. That is the point — surfacing those gaps in a controlled exercise is much cheaper than discovering them in an incident.
Is there a runbook?
During an incident, people need clear instructions. Document who decides, who communicates, who restores, who validates, and what order systems are recovered in.
A useful incident runbook covers:
- Decision authority. Who declares an incident, who authorises invoking the DR plan, who authorises customer communication.
- Roles. Incident commander, technical lead, communications lead, business liaison. With named primary and backup people for each role.
- Recovery sequence. Identity first, then network, then critical applications, then secondary applications. The sequence is documented, not improvised.
- Communication templates. Pre-written holding statements for customers, staff, and partners. Modified during the incident, but starting from a template.
- External contacts. Vendor support numbers, external legal counsel, regulator contacts (POPIA Information Regulator notification path), insurer details.
- Recovery validation checklist. What “recovered” means for each critical system before declaring service restored.
- Post-incident review. The commitment that every incident produces a documented post-mortem within two weeks.
Runbooks that have not been tested during a tabletop exercise are unproven. The annual DR test should include the runbook, not just the technical recovery.
Ransomware-specific considerations
Ransomware has changed the resilience question. Pre-ransomware, recovery was about hardware failure, accidental deletion, and platform outage. Post-ransomware, the threat actor is actively trying to compromise both production and backup environments, often spending weeks inside the network before triggering encryption.
The patterns that improve ransomware resilience specifically:
- Immutable backups in a separately-administered system. The most-targeted asset gets the most-isolated administration.
- Offline or air-gapped copies. A backup that is not reachable over the network when ransomware is active is the safe-recovery point.
- Detection on backup systems. Unusual deletion patterns, mass operations, or admin sign-ins from unexpected locations should alert.
- Tested clean-room recovery. The ability to rebuild a known-clean environment from immutable backups, with the assumption that the production environment cannot be trusted.
- MFA and conditional access on backup admin paths. Same identity discipline as the rest of the privileged access model.
The organisations we have helped recover from ransomware are universal in their feedback: the immutable, isolated backup copy was the difference between paying the ransom and not. Everything else was a complication; that one decision was decisive.
Where to start
If your organisation has not yet done a backup and DR readiness assessment, the order we suggest:
- List every critical service and assign a tier.
- Define and get executive sign-off on RTO and RPO targets per tier.
- Audit backup coverage, immutability, isolation, and retention against the tiers.
- Run a real restore test on at least one Tier 1 workload.
- Document the runbook and test it in a tabletop exercise.
- Embed the cadence — quarterly for Tier 1, annual for everything else.
Each step produces measurable improvement. Together, they convert resilience from an assumption into a tested capability.
The takeaway
Resilience is a business capability. The technology matters, but clear ownership, testing, monitoring, and decision-making matter just as much. Most organisations have most of the technology already; what they often lack is the operating discipline that makes the technology trustworthy.
This sits within our backup, disaster recovery and resilience service and overlaps with our cybersecurity and compliance work. Start the conversation if you want a structured assessment of where your resilience maturity is today.