Written by
on June 21, 2022
According to 2020 research from Gartner the average cost of IT downtime is $5,600 per minute. For some companies, an hour or two of downtime can mean hundreds of thousands (if not millions) of dollars in lost revenue.
As a result, companies are increasingly chasing zero RPO and zero RTO solutions. Let’s take a look at what those terms mean, and how advances in database technology have made what would once have seemed impossible into an achievable goal.
RPO stands for Recovery Point Objective. It refers to how much data loss is considered acceptable when a failure/outage occurs.
RPO is typically measured in units of time. For example, a company with an RPO of ten minutes has decided that in the event of an outage, it can afford to lose up to ten minutes of data (lost transactions, etc.) before the company is seriously harmed.
Zero RPO is how companies describe a setup in which no data loss is acceptable, even in the event of an outage.
RTO stands for Recovery Time Objective. It defines how much time it is acceptable for an application to remain offline in the event of a failure or outage.
RTO is also typically measured with units of time. A company with an RTO of ten minutes has decided that it can afford for its application to be offline for up to ten minutes in the event of an outage.
Zero RTO is how companies describe a setup in which application downtime is never acceptable, even when an outage occurs.
It’s important to note that RPO and RTO work in tandem. In the case of a fast recovery with non-zero RPO, businesses have to either try to manually reconcile their accounts or live with data loss.
On the other hand, a zero RPO solution that recovers slowly would result in significant application downtime.
An optimal disaster recovery plan needs to take into account the required RPO and RTO for a given application. As a result, the two are often discussed together.
Businesses care deeply about ensuring the resiliency and uptime of applications. Any downtime could directly impact top line revenues, hurt brand perception, and divert valuable resource hours to failure recovery processes. As a result, CEOs, CIOs, and top level technology executives are focused on meeting application uptime goals and minimizing the cost of infrastructure-level failures. For DBA teams, this means defining and meeting recovery point objectives (RPOs) and recovery time objectives (RTOs) for different tiers of applications.
For mission-critical applications, businesses need to get as close to zero RPO and RTO as possible to minimize the overall risk to both the business and their customers. An application that handles financial transactions with a non-zero RPO could lose deposits or transactions. A reservation system could lose customer reservations. Even worse, losing patient data in real-time healthcare systems could directly impact patient safety.
Yes. While it was once impossible, it is now possible to architect zero RPO/RTO systems.
However, it’s not easy! Building a zero RPO and/or zero RTO system is incredibly complex and requires the right choices at every level of your tech stack.
Let’s take a closer look at how zero RPO and RTO are possible.
Getting to zero RPO and RTO is possible, but incredibly complicated.
Several architectural layers contribute to RPO and RTO, including database systems, clustering technology, data replication solutions, and storage replication. Each layer is a separate product that must be integrated, configured, and set up by the company. Each layer needs a team of experts to set up, manage, and maintain the system.
Often, the combination of products and the way the products are configured become unique to a company. There’s no one-size-fits-all formula for getting to zero RPO and RTO.
This entire discussion also assumes that a company’s databases can actually achieve zero RPO and low RTO. In many cases, this is not true. Active-active setups are supposed to continue serving traffic without data loss in the case of datacenter level failures, but in practice, messages can be lost in transit between datacenters.
Further, these systems rely on timely detection of the failure to trigger recovery. Standby setups can have similar problems with lost messages during detection of failure and recovery.
In contrast to active-active and standby setups, NoSQL solutions can run on more than two servers, providing higher availability and scalability. NoSQL has built-in replication, which means that businesses don’t need a separate solution to support replication or clustering.
However, NoSQL comes with its own hidden cost. Although it can survive failures, the “eventual consistency” it promises means that stale data and split-brain situations can occur, leaving DBAs with inconsistent data that they have to reconcile. Downtime (RTO) may be reduced, but RPO can be damaged if the data contained in the database is either stale or incorrect.
Further, recovery time from disaster scenarios in NoSQL can sometimes take days, since data needs to bootstrapped and repaired to ensure that the data is usable and up to date.
CockroachDB delivers zero RPO by wrapping the complexity of building a highly resilient infrastructure into a single product. It reduces the component complexity of IT resilience by 75% by eliminating the need for separate replication, clustering, and storage solutions in order to achieve fault-tolerance. Instead, everything comes built into the database system software, reducing the cost and complexity associated with purchasing, deploying, and managing multiple solutions from multiple vendors.
Unlike NoSQL, CockroachDB provides ACID guarantees through consensus-based replication, so that data is always consistent and committed transactions are guaranteed to persist. CockroachDB can also be deployed on commodity hardware, since it has built-in resiliency for storage-level failures at the software layer. More detailed description of each of the layers is available here.
Underneath the hood, CockroachDB intelligently replicates data across the cluster, spreading copies out across different availability zones to provide the highest level of fault tolerance based on the available infrastructure.
This means that for any hardware failure ranging from disk to datacenter-level disasters, CockroachDB can continue to serve client traffic while recovery takes place.
CockroachDB is also architected to support an average of 4.5 seconds RTO. This includes both the time it takes to detect a failure as well as the time it takes to recover from it.
No other database vendor can provide these guarantees along with the ease of use and operational simplicity of CockroachDB.
IT leaders tasked with the difficult mission of shipping products faster while managing cost and risk have historically had to make trade-offs between protecting their data and the cost of doing so.
With new database technologies like CockroachDB, IT leaders are empowered to make zero RPO a baseline requirement for all core business applications given the high cost and risk associated with data loss. Finally, IT leaders can reduce the complexity of their data architectures while reducing risk, freeing them up to build reliable and innovative products quickly.
The truth is that RPO & RTO only tell half the story. The cost of downtime extends beyond the revenue lost. There is also the time spent solving the problem. Time not spent improving the core service or product. And there is the experience for the engineers and developers that are working on the solutions and dealing with the anxiety of downtime.
If you want to plan for survival instead of failure the modern, distributed SQL databases are the best option. CockroachDB has a free managed database offering that is excellent for experimenting with the database if distributed SQL is unfamiliar territory.
According to 2020 research from Gartner the average cost of IT downtime is $5,600 per minute. For some companies, an …
Read more
Slow applications kill business. Greg Lindon (in this now archived deck), noted that 100ms in latency lowered Amazon’s …
Read more