Understanding Cyber Resilience on AWS
Cyber resilience emphasizes the ability to recover workloads to a known-good state after an environment is compromised. Unlike prevention or detection strategies, which focus on blocking or identifying threats, cyber resilience ensures the restoration of a trustworthy environment. This approach is critical when backups, credentials, or infrastructure are potentially compromised, rendering them unreliable. Organizations hosting critical workloads on AWS are increasingly prioritizing such strategies to address ransomware, data extortion, and destructive events.
Recovery plans must account for the possibility that the recovery environment itself could become a target. By creating a design that isolates recovery processes from production environments, organizations can reduce their exposure to cascading failures and more effectively regain control.
Isolating Recovery from Production
The foundation of cyber resilience lies in ensuring that the recovery environment operates independently of the compromised production environment. This means that identities, keys, and network paths within recovery systems must not share any trust boundaries with production systems. This separation ensures that recovery can proceed even when production identities are compromised.
A widely used method involves creating separate AWS accounts within an AWS Organization. These accounts are categorized into distinct roles:
The Production Accounts host operational workloads and are isolated immediately after a cyber event is detected. Recovery activities are not performed here to prevent further compromise. The Recovery Account, on the other hand, owns the AWS Backup logically airgapped vault, which ensures that backups are protected from unauthorized deletions.
Leveraging Logically Airgapped Vaults
AWS Backup provides a mechanism to create logically airgapped vaults. These vaults store deletion-protected recovery points, safeguarding them from malicious actors or accidental deletions. This ensures a secure and reliable repository of backups that can be utilized during recovery.
These vaults are critical in maintaining the integrity of recovery points. By separating the backup storage from production environments, organizations can mitigate risks associated with compromised credentials or infrastructure. This setup adds an additional layer of protection for critical data.
Implementing the Rebuild-Restore-Rotate Framework
The Rebuild-Restore-Rotate (RRR) framework is a structured approach to recovery decisions. When facing a cyber event, organizations must determine whether to rebuild components from scratch, restore them from backups, or regenerate them entirely. This decision is informed by the nature of the compromise and the trustworthiness of available recovery points.
Rebuilding is often applied to code and configurations that can be regenerated from source control. Restoring is used for data stored in protected backups, while rotation is critical for sensitive assets like keys and credentials that may have been exposed during the attack.
Choosing the Right Recovery Point
Selecting the appropriate recovery point is a pivotal decision during the recovery process. The most recent backup may not always be safe, as it could carry the same threat that triggered the event. Validation pipelines play a key role in this process, examining backups to ensure they are both recoverable and free from malicious payloads.
By integrating automated validation processes, teams can systematically evaluate the integrity of backups before initiating recovery. This reduces the risk of reinfection and accelerates the path to restoration.