How to be an IT Hero by Reducing Downtime to Minutes

Let me set the stage.

I head up the Marketing department here at Infrascale and am not as technically savvy as I probably should be.  So, when technology breaks I tend to get a little whiny and demand instant help from my IT team.

I’m sure the tech team at every organization must cope with us non-technical Muggles as we navigate new technology and demand always-on availability.

This past week, I was having an impromptu online meeting when the line went dead halfway through the conversation.  We had lost Internet connectivity and, after a few choice “driving words,” I quickly contacted our IT support team pleading for an immediate resolution.

Keep in mind, we’re a disaster recovery company.  This is what we do for a living. Now, most people wouldn’t consider this a disaster, but in my mind, this at the least qualifies as a micro-disaster.

Sergio, our intrepid IT support specialist, discovered that this wasn’t an isolated event (translation: not user error) as other users were also unable to connect to our wi-fi network. As Sergio began to tick off the steps of his standard troubleshooting procedure, he realized that our DHCP domain controller had become frozen.  Perhaps, Sergio felt my withering gaze, but he knew that he couldn’t wait to perform his normal troubleshooting protocol. Sergio knew that the only way he could possibly get us back online was to boot a virtual machine of the domain controller from a recent backup stored on our local appliance. Finding the root cause of the problem could wait. Job #1 was to get us back online.

Sergio logged into our Disaster Recovery appliance – just like we instruct our partners and customers to do — and booted the domain controller from the most recent backup.  This enabled us to get back online within a minute or two.  I quickly resumed my online meeting.  No harm done.

When everyone was back online, Sergio was then able to focus on resolving the root issue which ended up requiring a forced reboot and 15-20 minutes of rolling back and reconfiguring Windows updates. These were important minutes that I didn’t have to spare.

Because of Sergio’s swift decision making, he reduced our downtime from around 30 minutes to just five minutes – simply by following our own guidance.

Think about how many of these micro-disasters happen to organizations every day.  Think about the business costs, lost productivity, and opportunity costs if you aren’t able to open emails for 30 minutes, four hours or even a day?   These micro-disasters add up, and significantly impact an organization’s bottom line.

But thankfully, there are now DRaaS solutions that let IT pros like Sergio step in and save the day.  You’ll still have to contend with us non-technical Muggles, but now you have a powerful and affordable tool in your arsenal to combat downtime, data loss, and us whiny executives.

99 problems but Automated Orchestration Ain’t 1

When it comes to your IT responsibilities, being able to recover data and maintain business continuity is among the most critical. But, there’s tradeoffs to factor in between cost, convenience and the benefit.

That’s why “DR tests” become much maligned – they require too much work, time, and effort to properly execute.  Fortunately, technologies like disaster recovery as a service (DRaaS) and CloudBoot Orchestration can alleviate the time, cost and complexity required to perform system recoveries.

01. Why are some DR tasks dreaded?

It truly comes down to a classic cost-benefit analysis. Dreaded tasks are the ones with high costs and moderate-to-high benefits, especially when those benefits are not immediate—think about how often most people floss. When you look at a routine IT task like ‘run a site-wide recovery test,’ it’s obviously valuable, albeit not immediate, but requires a lot of upfront work and time – two commodities in increasingly short supply.

At Infrascale, we’re trying to make the process of scheduling and conducting DR tests simple. Brain-dead simple.  And our new drag and drop orchestration is the next step in that evolution.

Convenience Cost Benefit Test Frequency
Backup LOW HIGH VERY HIGH ANNUALLY
DRaaS HIGH LOW VERY HIGH QUARTERLY
DRaaS + Orchestration VERY HIGH VERY LOW VERY HIGH MONTHLY


02. Simplifying Recovery

Recovery of a full network can be a complicated affair. The administrator needs to be aware of the dependencies of all the applications and machines in use to get a business back and in production.

Usually, this knowledge is shared tribally among the techs or the expertise assumed to be in place when disaster strikes.  But, this requires the system admin to deduce the recovery steps (and the system dependencies) by looking at what’s going on the network. That’s a pretty tall order, especially when the disaster is highly visible, stressful, and revenue-impacting. This is not a good recovery plan and leaves too many opportunities for human error to creep into the equation. A recovery plan should mean that steps to recover your systems are second nature, no learning needed.

With DRaaS-based orchestration, the administrator can predefine the order in which machines are recovered, and can add time-delays between specific machines or groups of machines to allow for additional tasks to be performed or to accommodate application/database load times.

This means that any technician tasked to recover a full site will be able reduce many steps that require pre-existing knowledge to a single series of documented steps that is simple, straight forward, and transferrable.

03. Audits, SLAs and Trust

When recoveries and test recoveries become a convenient and high value task, it also means that the pain normally associated with audits fall by the wayside.

Easier testing means you can test more granular scenarios and ensure that your dependencies have all been mapped out and tested successfully.  For example, you can simulate a ransomware attack that has infected all of your end users, the local backup files, mission critical databases and your file servers. You can even simulate a RAID failure on the hardware running vSphere or recover a CEO’s laptop that was crushed by TSA and you need to restore an important presentation in one hour.  Now, you have the time – before an actual disaster strikes —  to play out these scenarios to ensure that you can uncover any ‘gotchas’ and maintain your service agreements.

Regular testing and role playing translates into a smoother and more confident approach to real-life disaster scenarios. Click here to see how our Drag & Drop Orchestration simplifies your disaster recovery plans and system recoveries.

Ransomware Victims : A Tale of Two Ransomware Victims

I have often touted how DRaaS should be deployed to mitigate the damage, downtime, and data loss associated with a ransomware attack.  But, when one of our partners, Pervasive Solutions, had two clients hit by ransomware victims within 24 hours, it offered powerful real-world evidence of the power of DRaaS.

You can read the entire case study here: Pervasive Solutions Case Study. But, here’s the abridged version.

Pervasive Solutions is a Victor, New York-based managed services provider, information security consulting firm, and Infrascale partner.  Just after Thanksgiving last year, two of their clients – one was a retailer and the other, a local manufacturer – were targeted with ransomware.  The actual company names have been obscured since these companies don’t want the fact that they were infected to be on the public record.  This is pretty typical and why many experts believe the ransomware threat to be much larger and pervasive (pun intended) than has been reported.

What makes this story especially interesting is that the manufacturer was protected with Infrascale Disaster Recovery; the other wasn’t.  And this distinction made a huge difference when it came to Pervasive’s ability to recover their clients’ data and systems.

Here’s a quick snap shot of the results:

 

The key stat in this table is the amount of time required to get each client fully operational post infection. With our DRaaS solution, Jason Miglioratti, Director of Managed Services at Pervasive, was able to restore the manufacturer’ systems in less than an hour. Without a commercial grade DRaaS solution, it took Jason and his team two full weeks to get the retailer back online and fully operational.

For many companies, being down for a few days would be catastrophic.  But, thankfully, the retailer was able to weather this storm.  But, it’s clear that all businesses need to bake in operational resiliency into their IT infrastructure. Without having a data protection and recovery strategy in place, organizations are leaving themselves wide open to significant financial and reputational loss.

Thankfully, Pervasive Solutions is part of a new and emerging breed of MSPs that are going well beyond reselling IT solutions. They’re educating their clients that data protection requires a four-pronged approach which includes user education, strong security systems (e.g., AV, firewall, email-filtering, application white-listing, etc.), cloud-based disaster recovery, and regular DR testing.

Just as importantly, SMBs need to wake up and start asking important questions. How soon could our company get back up and running if it gets infected by ransomware victims? How quickly can you isolate and halt the spread of an infection? What would you do if your production database got encrypted? Would you be willing to pay the ransom?

If you want to start getting some answers to those questions, talk to us or one of our trusted partners.

 

How Ransomware is Beating your Backup

Traditional approaches to backup and DR simply don’t work against ransomware

It’s been over 2 hours since ransomware hit your business and you still have no update from your techs and none of your employees can work.

After what seems like an eternity, your technician emerges with a not-so-confident look and sheepishly admits “the problem is that the ransomware has infected your backups. I’m doing what I can to see how far back we can recover, but it doesn’t look good. We should begin setting up a bitcoin account in case we can’t recover from the backups within the next 15 hours, which is the amount of time we’ve been given to pay or they’ll delete the encryption key for good.”

You’re overcome with mixed emotions. You’ve been violated. You’re mad as hell.  You’re unsure whether you’ll get your data back even if you pay the ransom. As you go through the phases of grief, you become engrossed in the effect beyond the business to your personal life. Your head clears enough for you to start asking yourself how this came to be.

You did everything you thought was going to keep you safe, didn’t you?

  • You paid for a business-grade backup system
  • Your backups were regularly tested to make sure that they’re working properly
  • Your backup drives were refreshed to protect against hardware failure

Why then? How did ransomware beat the system that was supposed to save you?

This is not uncommon. In fact, in North America alone, over $1 billion USD was paid in ransoms over the course of 2016 due to this very common scenario. 2017 is predicted to be worse. Much worse.

Here are four reasons why your backups didn’t save you:

One. These are criminal organizations and attacks are not random.

They have purposefully designed their viruses and exploit kits to increase the success rate of collecting ransom payments. They use social media and even your own website to figure out how to best penetrate your business. Who works there? What servers and services are your users and business using?

Two. Ransomware attacks are increasingly targeting your critical applications.

Previous viruses were largely covert, quietly stealing data for as long possible without being discovered. In 2015, ransomware targeted users by encrypting files on individual machines before presenting clear instructions for payment.

By 2016, ransomware firms began targeting businesses by using your employees as entry points before accessing and encrypting critical applications (e.g., your Exchange server, SQL servers, Oracle database, etc.) on your network, locking you and your users out via strong encryption algorithms.

Any application, service or network location with heavy traffic becomes a major target
because the impact of downtime is heightened, increasing the value of the data being held hostage and therefore, the likelihood that you’ll pay the ransom.

Three. Backup systems are their kryptonite, and are their top priority.

They know that a business’s ability to recover data and critical systems is directly related to the chance to collect a ransom payment. Therefore, these firms target backup files as a top priority before triggering their virus to encrypt files and display a ransom notice.

If backup and/or DR files are stored on a network-accessible drive, the ransomware viruses will be able to locate them.

Typical backup programs write files in a proprietary or common format. Known file-types are easy to search and discover once network access is gained.

In addition to file-type searches, ransomware kits will look at Volume Shadow Service (VSS) logs as an easy way to find where backups are being written since many backup services will use VSS to create backups for databases and other open files.

Once the location is discovered, only a short-time stands between the virus and your critical applications and files.

Four. Backup systems typically store files on administratively accessible drives/locations.

Gaining network administrative access is a primary objective because it allows ransomware variants to read/write data on the most critical locations on the network. With this access, they can encrypt the backup files themselves, meaning there’s not even an option to test recover to see if there are or are not infected files—the backup file itself is completely useless. This situation leaves a single option to recover the data—pay the ransom.

What can you do?

Get a cloud backup/DR system.

By moving backup/DR files to the cloud, you can at least recover a previous version before the infection took place, since the virus will not be able to access and infect files already stored in the cloud.

You still have to download and recover the files to a safe location and test recoveries for individual file infections before moving to a production environment. This can take time, but at least you haven’t lost valuable information.

Get an enterprise grade Disaster Recovery as a Service (DRaaS) solution.

A proper DRaaS solution will lock administrators and intruders out of the storage used for the backups and DR files while still being stored on the network. Management access to these files is only granted through the software/portal given to you by the solution provider and no level of network administrative access will allow viruses like ransomware to infect the actual backup files.

A cloud-DRaaS solution wherein all backups are replicated offsite will allow a much faster recovery via cloud-based recovery of entire machines from which your users can continue work while a production environment is prepared for final recovery.

What a ransomware experience should be…

It’s been roughly 30 minutes since your tech began investigating. All critical servers have already been failed over to the cloud and verified to be virus free. You’ve been given an estimate of roughly 1 hour before your users will be reconnected and ready to work. You tell your staff to take an executive lunch but to be ready for work upon their return.

Infrascale + Google Cloud: Faster Failover Starts with a Faster Cloud

Today, we’re announcing a partnership with Google that pairs our cloud backup and disaster recovery-as-a-service solutions with the Google Cloud Platform.  I know it’s a cliché with any partner announcement – one plus one equals three —  but, I think in this case there’s real proof in this pudding.

The Problems We’re Trying to Solve

Let’s start with some jaw dropping stats that are impacting organizations of all sizes, but are especially devastating to SMBs who often lack the IT resources and manpower to defend and quickly recover from prolonged periods of downtime.

  • Ransomware: Ransomware is the number one cyber threat on the planet. 70% of businesses hit by ransomware paid the hackers to regain access to systems and data. Of those attacked, 20% paid over $40,000 to retrieve data, while more than half paid more than $10,000. Source: IBM X-Force’s Ransomware, December 2016.
  • Downtime: At an estimated $700 billion is losses per year, the average cost of IT downtime is about $9,000 per minute for most midmarket enterprises (obviously this estimate will vary by the size and type of organization). Consider that complete unplanned outages, on average, last 66 minutes longer than partial outages. You can do the math and the impact is scary. Source: Poneman Institute, Cost of Data Center Outages report, January 2016.
  • Data Loss: Data loss statistics can be chilling. Studies suggest that nearly 3 out of 4 of the companies lose critical data – from lost mission critical software applications to lost virtual machines to lost critical files every year. Adding insult to injury, more and more organizations are being breached by cyber criminals and no location, industry or organization is immune from attack. Source: The Cost of Server, Application, and Network Downtime: North American Enterprise Survey and Calculator, ISH Inc., January 2016.

To help address these threats to uptime, we’re integrating our data protection solutions to Google Cloud.

Backup to the Google Cloud

Infrascale has always offered broad OS support and a variety of cloud targets – whether they be public or private clouds. Now, Infrascale customers and the partners who serve them can replicate to Google Cloud.  Google Cloud Platform lets you focus on what’s next for your business and frees you from the overhead of managing infrastructure, provisioning servers and configuring networks. This powerful infrastructure improves the performance of backup and recovery for everyone.

Disaster Recovery in the Google Cloud

Infrascale Disaster Recovery protects your organization against server failures, site-wide disasters, and even ransomware attacks. It delivers guaranteed 15-minute failover of mission-critical applications in the event of a minor or major crash. In fact, our own testing for VM failover within the Google Cloud is lightning fast and measured in seconds, not hours.

You are probably justifiably skeptical and you should be. But, this is the type of performance you have to experience for yourself. Given the opportunity, we’ll combat your skepticism with real results, real performance, and real proof.  The pairing our leading-edge CloudBoot technology with the Google Cloud Platform delivers eye-popping boot speeds.  You can read about our failover technology, but a big part of the performance gains is the faster cloud that Google Cloud offers.

How can a public cloud deliver that kind of failover performance? 

There’s a lot of innovation that Google has baked into its data centers and worldwide fiber network that give it a leg up on other cloud infrastructures, including

  • Sub-second Archive Restore: Google Cloud delivers sub-second data availability and provides high throughput for prompt restoration of data. Competing systems take 4-5 hours to do the same data archiving tasks, offer substantially lower throughput, and often charge confusing and expensive fees for restore.
  • Global load balancers that scale to 1 million+ users instantly: Google Cloud’s built-in load balancers are part of a worldwide distributed system for delivering enterprise-class infrastructure to organizations, big and small — the same system that supports Google Maps, Gmail, and YouTube.
  • Faster boot times: Google Cloud Compute Engine instances boot in the range of 40-50 seconds, roughly 1/5th of the time required by competing clouds.
  • Reduced Latency: Google’s global network footprint, with over 75 points-of-presence across more than 33 countries, ensures you receive the same low latency and responsiveness customers expect from Google’s own services.

Continued Cloud Evolution

In the early days, cloud platforms like Google focused their efforts on attracting start-ups and young, agile companies that were ripe for the cloud. This made sense, as their platforms offered these companies a quick and easy alternative to conventional, on-prem IT—as well as the ability to scale their operations without a lengthy procurement process.

But now the tables are turning. Midmarket and enterprises are waking up to the benefits of on-demand, pay-as-you-go infrastructure and following in the footsteps of the early adopters. That’s why this partnership is so exciting to us. We’re combining the speed and innovation of our own failover services with the power and performance of the Google cloud to protect an organization’s most valuable assets – uptime and data.

This partnership gives us the opportunity to equip organizations of all sizes with much improved operational resiliency and ransomware insurance that’s affordable, simple, and secure.

Will Ransomware Force You to Fire Your Customers?

Ransomware’s effect on IT service providers can be just as damaging as the businesses hit.

The imminent danger of ransomware is real.  Even those that don’t typically follow or cover tech news have probably heard of it and are rightfully concerned. In 2016, ransomware surpassed $1 billion in ransoms collected and inflicted $70 billion in downtime losses.  In fact, 7 out of 10 executives said they’ve been willing to pay up to get company data back (according to a recent IBM survey).

Sadly, 2017 is projected to be even worse. However, most SMBs need experts to lead them to safety.

Enter the IT pro.

IT professionals have a unique relationship with the businesses they serve. On one hand, IT pros are the only ones that can solve their problems, the expert savior. On the other, they may come across as a used car salesman despite their best efforts to put the customers’ needs first.

Ransomware not only impacts the customer infected, puts it also imposes a heavy tax on the MSPs that serve them.  That’s why MSPs should re-evaluate their role with their clients and mandate certain data protection processes or risk their own financial well-being.

Here’s why:

  1. You can’t afford to spend limited resources on avoidable situations
  2. Everyone loses in an ‘I told you so’ moment
  3. You have more than one business to maintain

You can’t afford to spend limited resources on avoidable situations

When it comes to ransomware, a poorly designed or maintained disaster recovery (DR) solution can force IT professionals to spend collective man-weeks instead of mere hours to resolve the problem.  In some cases, organizations may not be able to even recover their data at all.

Here’s a real-life example.

In the first week of December 2016, we had a partner (let’s call them “ACME Solutions,” that had two different customers (A and B) that were hit with ransomware within the same week. Customer A did not have a DR solution at all, but did have some network backups. Customer B had recently deployed our DR as a Service (DRaaS) solution.

 

Customer Systems Impacted DR Solution Man-Hours Involved
A Fileserver, database server Traditional backup 1,000
B Fileserver, database server DRaaS 3

 

Customer A’s fileserver, database server, and network backups were all infected. Within three days, a partial recovery of the fileserver was complete. Within seven days the database server was still offline and still unrecoverable. By day 10, less than 50% of the database was recovered. ACME had two techs, a service manager, an account rep, and even their President engaged in the recovery of the data and the customer relationship. That’s five people spanning five departments for over a week.

Customer B’s fileserver and database server were infected. Within a matter of hours, a single technician recovered all the data and had the business back up and running. Let’s round it to three man-hours total, one person, one department.

Everyone loses in an “I told you so” moments

IT professional understand that they’re responsible for all things IT—even when their clients don’t always heed their guidance. The situation with Customer A consumed ACME’s resources from sales, the technical team and executive-level management. Even if ACME charged for every hour, which they certainly could not, they would have clearly lost money. In fact, it will probably take another year of service to break-even with Customer A assuming they remain customers after the ransomware episode.

Beyond that, there’s also the word-of-mouth problem.  Will Customer A be a raving fan of ACME? Doubtful.  It’s unlikely Customer A will be tossing any referrals their way any time soon (even if they knew that they shared in some of the responsibility).

You have more than one business to maintain

Many IT service companies have 20-50 active customers at any given time. In the case of ACME, they had 2 different businesses struck with ransomware simultaneously. Imagine the impact if both customers were setup like Customer A? This is what we call kicking a man when he’s down.

By the numbers, more than 50% of all businesses in the US have already been targeted by ransomware. And being targeted once doesn’t mean you won’t be targeted again. What if half of your customers were hit with ransomware in the same 6 months? Same month?

It’s time for a rethink. It’s time to mandate that clients adopt DRaaS as part of their monthly subscription.

Each customer should be setup so they can quickly failover systems and recover data in the event of a server outage or ransomware infection.  If all customers are configured like Customer B, partners and their clients can sleep at night. If all customers are setup like Customer A, how deep in the red is ACME willing to go before needing to downsize?

Ransomware doesn’t have to be scary

When organizations have an adequate backup and DR solution in place, everyone wins.

Cloud-based DRaaS solutions have come a long way in the last year in terms of functionality and affordability – in fact, cloud failover solutions are now affordable for most organizations.
The economics dictate that IT professionals take a hard look at their client portfolio. They can’t afford to have ransomware-susceptible customers risk their own financial future.  It’s time for tough love.  It’s time for IT pros to fish or cut bait — and this just might mean firing a few customers who are unwilling to adopt DR best practices.

There’s just too much at stake.

The Importance of Orchestration

The global DRaaS (Disaster Recovery as a Service) market has witnessed significant growth in recent years, and is predicted to grow 36% between 2016 and 2024 on a compounded annual basis. This growth has predictably attracted new market entrants and changed the market dynamics.

Uptime is an operational imperative. So any form of downtime from an Exchange crash to a site wide disaster (tornado, hurricane, flood) to a ransomware infection can cost an organization dearly in terms of lost revenues and productivity. The right DRaaS solution with well-tested orchestration can dramatically reduce the amount of downtime and stress associated with these incidents.

That’s why it’s increasingly important to find objective measures to separate the contenders from the pretenders. One of the key differentiators is how solution providers deliver orchestration – the orderly recovery of a server environment during an outage. Orchestration ensures that critical servers, applications and their dependencies come online without incident. It’s important to understand exactly how your vendor plans to failover your applications, and then failback, in addition to how much customization and control you have in the orchestration process.

When it comes to unplanned downtime, an ounce of prevention is worth several pounds of cure.

When disaster strikes or critical systems crash, IT administrators have to be thoughtful about how — and in what order — they restore applications.  The order of operations is crucial for a seamless system restoration.  For example, if your environment utilizes a DHCP server to manage leases on your machines, this server would be among the first applications to be brought online, because of the importance of assigning IP addresses and providing configuration information.  You may also want your AD server to come online shortly thereafter, if not concurrently, to automate network management of user data, security, and distributed resources.

After you resuscitate these core systems you will want to restore your production workloads such as SQL Server, Exchange, and other mission-critical apps.  Then, you can boot your secondary applications. Order clearly matters, and orchestration sequencing is the means by which DRaaS solutions restore applications in a predetermined order.

Not all vendors treat orchestration equally; you have to uncover if — and how — your DRaaS vendor can deliver on this functionality.  There are four core ingredients and components of orchestration:

  • Runbooks: Most cloud recovery providers offer a simple disaster recovery runbook that describes the order in which your systems (VMs) should recover. The runbook defines a group of machines that are powered on (simultaneously) with a single command.  The real power of orchestration, however, is the ability to determine the actual order (not just a group of apps that boot simultaneously).  This is where scripting comes into play.
  • Scripting: The other half of orchestration is scripting. IT can create simple, customized scripts (basic commands) that execute more complex configuration for their runbooks. This includes everything required to execute a complete recovery.  Scripts can also be used to ensure that machines without DHCP servers can be rebooted with their proper network configuration (such as IP & Mac addresses).
  • Testing: Another key component of orchestration is the ability to test the failover process and ensure the runbook and scripts work as expected. Unfortunately, many DRaaS vendors charge for DR tests and/or require formal disaster declarations to perform these tests. Increasingly, IT administrators are looking for a self-service failover solution that puts the control back in their hands. You’ll want to test your orchestration periodically after the initial setup, system variables continuously change (e.g., when you deploy new service packs), it’s not a one-and-done activity.
  • Failback: After your production servers are running virtually, IT is freed up to rebuild your hardware in anticipation of application failback. Once the hardware has been properly configured (post disaster), then it’s time to restore applications and their operating systems. If it’s a physical machine, then you can use a USB drive or disk to recover from a pre-installation (PE) environment. If it’s a virtual machine, you can simply push the guest back to its corresponding host. All of this can be done while capturing any changes made by the users’ while working with the ‘booted’ image (during the outage).

At Infrascale, we’ve invested in orchestration to be the easiest and most customizable DRaaS solution on the market.  We enable runbooks to boot up specific VMs and groups of VMs, as well as custom/canned scripts to manage the boot sequence of applications, all based on your specific environment.  But, we’re taking this a step further. We’ve even built a simple “drag and drop” interface that lets you build out your orchestration sequencing. Users drag and drop applications and custom/canned scripts from a network tree view to create the designed workflow. We also offer unlimited testing so you can test and retest your orchestration with impunity.

As you give DRaaS solutions a closer look, it’s imperative to ask any prospective vendor how they manage the orchestration process. It’s important to go beyond simple DR runbooks to create a more comprehensive disaster recovery playbook.  When orchestration is well planned, coordinated, and tested, it can have a dramatic impact on reducing the amount of downtime for any type of micro- or macro-disaster.  And just as important, it will have a dramatic impact on your stress level, by giving you the confidence of knowing that you can recover from anything thrown your way.

You Can’t Have Business Agility without Operational Resiliency

These are turbulent times.

So, organizations must increasingly stay agile to more readily identify and capture opportunities quicker than their rivals.

In fact, nine out of ten executives ranked organizational agility both as critical to business success and as growing in importance over time. The benefits of enhanced agility, include higher revenues, more satisfied customers and employees, improved operational efficiency, and a faster time to market.  To survive and thrive these days, companies must quickly adapt, pivot, and course-correct.

Few business owners would disagree with this premise.

But, what these same owners may not fully grok is the connection between business agility and operational resiliency. They may not realize how business agility is tethered to system uptime and data protection. So, let me connect the dots and explain why this connection is so crucial in today’s age of data breaches, ransomware, and system downtime.

Let’s start with a definition of business resilience. Business resilience is the ability an organization has to quickly adapt to disruptions while maintaining continuous business operations and safeguarding people, assets and overall brand equity.

So, what are the biggest threats to operational uptime?  Here’s a list of some of the five most common factors that can bring a company’s operations to its knees.

  1. Ransomware. Ransomware is on the rise (almost half of US companies have been victimized) and can bring your systems to their knees. Servers, applications and databases may become infected by ransomware, viruses or malware and render the tools that your team needs inoperable.
  2. Failed Hardware. These types of failures often occur when defective or old hardware fails due to standard wear and tear.
  3. Improperly Scheduled Downtime. Planned network downtime is a business necessity and required to install new hardware, software, and updates. But, poorly timed or too frequent system updates can be costly and hinder team productivity.
  4. Human Error. Predictably, the most common hazard to your network is human beings. Human error is by far the most common culprit of network downtime. Accidentally shutting down the network, overloading circuits, as well as pulling out the wrong cord are common causes that you’re simply unable to plan for.
  5. Mother Nature. Earthquakes, hurricanes, tornadoes, fires and other acts of Mother Nature are always a real threat to your network uptime. But, these acts only account for about 5% of all system downtime.

Despite these growing threats to uptime and data loss, most organizations still just pay lip service to operational resiliency.  If resiliency is, in fact, a top IT priority then companies should be investing in the systems to protect uptime and safeguard against data loss. However, this does not appear to be the case. Consider these stats:

  • More than a third of companies don’t even test their backups
  • 42% of attempted recoveries from tape backups fail
  • More than 50% said their current backup solutions do not meet their needs
  • 21% say their backup was not current, reducing the likelihood of retrieving relevant data

There’s a clear disconnect between the need to stay agile and the requisite systems/processes needed for operational resiliency.  A disconnect that may not be obvious until an organization is hit with significant unexpected downtime.

The good news is that modern DRaaS (disaster recovery as a service) solutions are making it easy and affordable to bake in operational resiliency into your procedures. These solutions are simple to implement, easy to use, and automate the backup process. They enable IT to recover from system outages in seconds vs. days or weeks and cost a fraction of traditional hardware-centric availability solutions.

As 2016 comes to an end, it’s time to make DRaaS a new year’s resolution.

Why is DRAAS a DRaaS-tically better defense against ransomware?

If your production database or mission-critical application gets infected, how long would it take you to recover?

When dealing with today’s ransomware threats, time is your worst enemy. The faster you can detect the encryption, the more time you have to take actions to restore your database or mission-critical application. Simple backup procedures will let you restore your production database, but it will take significantly more time than a modern disaster recovery as a service (DRaaS) solution.

Compare the process of restoring a production database from a cloud backup vs. a modern DRaaS solution:

graph-compare-restoring-process-1366x606

Here’s three tangible ways DRaaS can pay big dividends and quickly restore systems in the wake of a ransomware attack:

  1. Dramatically Faster RTOs. DRaaS solutions equip you with the ability to quickly failover productions systems by spinning up VMs or images in the cloud (or a local appliance) in minutes. Restoring your files from a clean backup will take 4-5 hours and that’s if the stars align. That’s a big difference — minutes vs. hours – and that difference can be catastrophic depending on the business and the transactions feeding your production databases.
  1. Quickly Pinpointing the Time of Infection. With a cloud backup, it takes a while to determine if your application has been corrupted. Admins must download the application files from the cloud (based on your most recent backup), rebuild, and then compile the database or application. If the application runs, then you know you have restored a clean copy; otherwise, you need to go back to your next recent backup and recompile. This can take hours. With DRaaS, admins can boot a production server and immediately verify whether the application is infection-free. If it successfully boots, then you have a clean image. This takes the guessing game out of “Is this a clean backup?”
  1. Built-in Orchestration. When it comes to restoring applications and production databases from a backup requires some planning and coordination. Leading DRaaS solutions include built-in failover orchestration that let you create predetermined failover plans for a group of replicated VMs, which can be to boot simultaneously or in a specific order.

Want to learn more about how Infrascale is helping organizations cost-effectively defend against ransomware, learn more here.  If you’re a channel partner looking for more information about our ransomware capabilities, we encourage you to check out www.AreYouASoftTarget.com.

 

 

After 10 years free of tape backup abuse, I’m a changed man

A recovery story

For nearly 10 years, I used tape backup and archival systems

At first introduction to using tape, it was uncomfortable. It was clunky and complicated. It felt like there had to be something better. I suggested moving to a more efficient disaster recovery (DR) system, but, at the time, all the options were far too expensive and my budget requests were denied. Eventually, I grew to accept that tape backups (and time-consuming recoveries) was just the way IT worked.

In fact, after the first year, I became dependent on the tape system as my main responsibility, my precious.  Every day was consumed by the same monotony—remove, label, store, replace. My entire value to IT had been defined by my tape use and management. At first, the pitfalls of tape affected my work, but everyone seemed to painfully accept that this was the way IT worked. Then, the tape addiction began intruding on my personal life—late hours, broken plans, sleepless nights—but I reassured myself that tapes weren’t the problem. That is, until I lost my job. I hit rock bottom. I sought help. As soon as I got tape-free, my life turned around—almost immediately.

Life on Tapes

If you’re like me, responsible for managing your tape backup system, this is what a typical week looks like:

  • Monday . This is the official tape drive cleaning and offsite day.You grab your favorite cup of coffee and hope that you have green lights from the backup that ran on Friday night, so all you have to do is take it out and put in the cleaner tape before beginning your day.

    This is done to clear out dust and keep your tape drive hardware running at its 20% reliability rate. This isn’t very hard, you’re only away from your desk for an hour or so. You return to your desk by mid-morning to begin going through unanswered requests while checking vitals on the rest of your systems.

  • Monday – Friday. Just about every day, a coworker, manager or boss calls fretting about a lost file, typically an email—the sky is falling and you’re supposed to drop-everything to recover it.Your first, unintentional, response is to let out a long, drawn-out sigh. Coworkers love this.  How important is it?You spend the next 3-4 hours in the server room, recovering the file. You miss lunch and settle for some empty calories found in a candy bar you keep at your desk for such occasions. Meanwhile, other, more pressing requests are piling up at your desk.
  • Beginning of Each Day. When the business day starts, it’s time to verify last night’s backup, remove and order it into your labeled tape area, then replace it with a tape next set to be overwritten.  We used a tray stored in a cabinet, near the tape drives in the server room. If you’re lucky, you convince the office manager to help you out by doing the tape switch so you can respond to other requests. Hopefully, they will preserve the organization you’ve established as part of the tape routine.
  • Friday. Friday is full-backup day, when you switch the tapes as usual but, this time you’re going to change the backup to run a full instead of an incremental. This also means you need to take the last full backup offsite.  We stored tapes in a safety deposit box. Because Fridays are always busier, you have to leave early to beat the traffic, again, leaving important requests unanswered. Sometimes, there’s just no time in the schedule to take the tape offsite, so it sits safely in your car until Saturday.
  • Occasional Saturday Night Panic. You get a distress “9-1-1” call from the boss saying that the everything is down. There’s been a hardware failure on the company’s email and SQL server. You ditch whatever personal plans you had and drive to the safety deposit box to pick-up the last full backup tape, then drive onsite to get to work on the system recovery. You arrive in an hour or two.If you dodge the near 40% failure rate of tape recoveries, and have spare hardware available, you just might be done by midday Sunday. Throughout the night you update your friends or family that you won’t be home tonight and aren’t sure when you’ll be finished. If they’re good friends, maybe they’ll come visit you with a coffee and snack in the morning to keep you company?

The glaring problem here is that the whole system blocks productivity—for IT, for coworkers, for the business. From the slow response and recovery times, to the manual offsite storage and retrieval, to the requirement to be away from your desk just to check statuses.  Your system’s limitations have nothing to do with you. You’re just working with what you have.

How IT Ends

The final straw comes when that Saturday night panic call happens, you go through the motions—but there’s nothing you can do except send the tapes to an expensive and slow, physical data recovery firm.

Of course, you help users with other problems like troubleshooting systems, deploying patches, running updates, managing the network—but this single, critical system is what defines your entire value; the backups, tape backups.

When IT gets better

At some point, you realize that this lifestyle and reliance on yesterday’s technology is not smart, efficient, or soul-enriching.  It shouldn’t take 4 hours to restore a file — it should take 4 seconds.  At this point, you start hearing some noise about the cloud and more automated, efficient approaches to DR.

All the experts are marching in the same direction and recommending modern, DR solutions. Many suggest leveraging the cloud to make backups and DR more affordable, more reliable and easier to use.

When you first start working with this new paradigm, your life changes immediately and dramatically:

  1. Monday. A day like any other. You check the status of all your backup systems from your laptop.  You don’t even need to be onsite. Your focus is on productive activities…
  2. Monday-Friday. This part doesn’t change. Every day, coworkers still request email and file recoveries. Except this time, they’re not as stressed because they know you can easily recover their file(s) in a few clicks. You don’t even need to put down your morning cup of jo’ .
  3. End of Each Day. Every day, your backups are replicated offsite on a schedule of your choosing. You don’t need to do anything. Instead, you focus on productive, meaningful work.
  4. Saturday Night Panics. Oh, these still happen. But, they don’t cause the same night sweats. You just login from your laptop, see that your backups are there, boot up your machines in less than 15 minutes or recover the necessary files to the appropriate machines. You feel good, and so does your boss. Now, go enjoy the rest of your night, Mr. Hero.

Since finding the right tool for the job, you spend more time being an asset to the company than ever before. You have more time to help those around you. You have more time to find other areas of the business that can be improved. Your “wish list” of things to get done, starts getting done.  You’re breaking less personal engagements and sleeping better. Suddenly, you’re called into business meetings to talk about efficiency—new tools, new training, new productivity. Your voice is heard and respected. You are not simply a cost center. You’re happier.

Who would have thought that being dependent on tapes would have such a profound effect? I mean, everyone was doing tape, and tape was normal.  IT was the way IT had always been, until IT wasn’t.

Here’s to the new normal.