If you have never seen or heard the term ‘RTO’ in the context of your business continuity plans or tests, then this will give you a solid next step to ensure that you’re in a good position. Unfortunately, nearly 80% of all SMBs are in the same boat, which has been and continues to be massively exploited by criminal organizations using ransomware to make money. Lots of it.
To paraphrase an old tech adage “if you can’t recover quickly, then it’s not a backup.”
What is RTO?
Recovery Time Objective, or RTO, is the time it will take to restore business operations in any event of downtime caused by hardware failures, ransomware infections, software errors, human errors, and natural disasters
Unfortunately, for many businesses, the problems that arise when RTO is not a key component of the plan isn’t realized until it’s too late. Many organizations have found this out over the last few years because of the ever-growing threat of ransomware attacks.
Many businesses with preventive measures and backups in place end up in a bad situation because their plan didn’t factor in the recovery time for restoring production databases or mission-critical applications. Read our Tale of Two Ransomware Victims for more info.
What is business continuity and what role does RTO play?
Business continuity is the ability for a business to remain in operation despite risks and events of downtime and disasters. By the numbers, 80% of businesses experience some type of unplanned downtime. Of this total, some experience catastrophic outages that knocks them offline for 3-5 days – and apportion of these never recover and ultimately out of business as a result of the outage.
Simply put, RTO is Business Continuity.
A proper business continuity plan includes:
- Identification of potential downtime risks
- Evaluating the business impact of those risks
- Identifying ways to prevent those risks
- Identifying ways to recover from downtime
- Regular testing of those methods against specific risks
- Regular re-evaluation of risks & methods
Your prevention and recovery needs are directly based on the evaluation of risks. Such an evaluation is known formally by Project Management Professionals (PMPs) as a “Risk Registry.” Don’t worry, it sounds like more work than it is.
It’ll actually save you time as ensure that all your bases are covered by understanding your critical systems and their dependencies.
Evaluating Your Risks
Evaluating risks can start pretty general and become more specific as you get closer to making buying decisions. For example, the table below was developed by American Precision Industries that focuses on recovery at a system level.
|Application/Data/System||Impact||Chance||Risk Factor||Recovery Plan|
|CAD application server||99%||100%||99%||Infrascale Disaster Recovery replicating from site A to site B. Local boot for testing or individual machines. File recovery readily available from either site. Spare hardware required in the event of hardware destruction. Restore time is less than 20 minutes once hardware is available for recovery.|
|Machine Tools||100%||<1%||<1%||N/A. These units are closed systems.|
|CAD files||80%||100%||80%||Files are protected by Infrascale Disaster Recovery and replicated to a secondary DR appliance and are available for restore within minutes. Files can be recovered to any USB device to then be fed to the machining tools’ systems.|
|Payroll DB||60%||100%||60%||Infrascale Disaster Recovery replicating from site A to site B. Local boot available for recovery in less than 10 minutes. Production recovery time dependent on available hardware, less than 20 minutes once available.|
|Customer/Order DB||80%||100%||80%||Infrascale Disaster Recovery replicating from site A to site B. Local boot available for recovery in less than 10 minutes. Production recovery dependent on available hardware, less than 20 minutes once available.|
|CAD user endpoints||70%||100%||70%||Systems are backed up centrally and covered in DR backups onsite and replicated to the secondary. Endpoints can be restored within 20 minutes once hardware or a VM is available.|
The table above shows the impact to the business in terms of “how much of the business will be inoperable if this system goes down?” with the chance of that system experiencing downtime (all risks included), and the risk factor, which is the product of Impact and Chance. The rule of thumb is to pay close attention to any Risk Factor over 10%.
Once all systems are listed and evaluated, you can begin posing options for various disaster recovery options and RTO objectives. This will ensure that you have a plan that you need rather than a mix of “too much” or even worse, “too little”.
You can also add specific uptime goals for specific systems, like this:
|CAD application server||IBM compatible||Windows||<12 hours, 99%|
|Machining Tools||Proprietary||Proprietary||NA, 99.9%|
|CAD files||IBM compatible||Windows||<12 hours, 99%|
|Payroll DB||IBM compatible||Windows||<24hours, 99%|
|Customer/Order DB||IBM compatible||Windows||<24 hours, 99%|
|CAD user endpoints||Various||Windows||<12 hours, 99%|
The benefit of this preplanning far outweighs any time you saved by skipping it and “hoping” it’ll be enough. Every year, thousands of businesses discover that their “hope” was indeed a poor plan when something takes their business out of operations and they scramble to get back online.
Unfortunately, when it comes to recovery, there are no second chances.