Tag Archives: disaster recovery

Nashville Bombing Part 2

As I said last week, while the bombing is a horrible event, it does point out how brittle our telecommunications world is. That being said, for most companies, the rest of the IT infrastructure is probably more brittle.

Companies should use this as an opportunity to review their situation and see if they can make improvements at a price that is affordable.

While AT&T was able to strike a deal with the City of Nashville to commandeer Nissan Stadium, home of the Titans, to set up a replacement central office, you probably will not get the same treatment if you asked.

AT&T was also able to deploy 25 tractor trailers of computer equipment to replace the equipment that was damaged.

Finally, AT&T was able to temporarily reassign personnel with every skill that they might possibly need from fiber techs to computer programmers. Again, you likely would not be able to do that.

The question for you to “game out” is what are my critical vendors and what would I do if they had a meltdown. I don’t mean a 30 minute outage, I mean a meltdown. We have seen, for example, tech companies that have gotten hit by ransomware.

Perhaps, like many companies, you use a managed service provider or MSP. A number of MSPs have been hit by ransomware and when they do, often so do their customers. Does your MSP have the resources to defend all (or most of) its customers from a ransomware attack at once. How long would it take your MSP to get you back to working? Even large MSPs (which equals many customers) likely don’t have the resources.

If that were to happen to you – and of course, they have the only copies of your data, right? – what would they do and what would you do?

Maybe your servers are hosted in your office. There are a lot of possible events that could occur.

Even if your servers are in a colo, things can occur that can take you down.

Here is one thing to start with –

For each key system from personnel to public web sites, both internal and at third parties, document your RECOVERY TIME OBJECTIVE or RTO. The RTO is the maximum acceptable downtime before recovering. For example, for payroll, it might be 24 hours. But what if the outage happens at noon on the day that payroll must be sent to your bank? So, think carefully about what the maximum RTO is and remember that it will likely be different for different systems.

Then, for system, document the RECOVERY POINT OBJECTIVE or RPO. The RPO is the point in time, counting backward from the event, that you are willing to lose data. For example, if this is an ecommerce system, maybe you are willing to lose 30 minutes worth of orders. Or maybe 5 minutes. If it is an accounting system, maybe it is 8 hours (rekeying one day’s worth of AR and AP may be considered acceptable). Again each system will likely be different.

Then get all of the lines of business, management and the Board (if there is one) to agree on those times. Note that shorter RTOs and RPOs mean increased cost. The business units may say that they are not willing to lose any data. If you tell them that you can do that, but it will cost them a million dollars a year, they may rethink that. Or management may rethink that for them. The key point is to get everyone on the same page.

Once you have done that, make a list of the possible events that you need to deal with.

  • Someone plants a bomb in an RV outside your building and there is severe physical damage to your building.
  • Or maybe the bomb is down the block, but the force of the blast damages the water pipes in your building .
  • Or, the bomb is down the block and there is no damage to your building, but the city has turned off water, power and gas to the building. And the building is inside a police line and will be inaccessible while the police try to figure out what is going on.
  • In the case of AT&T, they had to pump three FEET of water out of the building. Water and generators are not a good mix. Neither are water and batteries. While AT&T lost their generators as a result of the blast, their batteries were distributed around the building so they did not lose ALL of their batteries.

Note that you do not need to think up all the scenarios yourself. You can look at the news reports and after-action reports from other big, public meltdowns. Here is another article on the Nashville situation.

Now create a matrix of events and systems for your RTO and RPO numbers. In the intersection box, you can say that you already can meet those objectives or that it will cost $1.29 one time to meet it or a million dollars a year. You need to include third party providers if they run and manage any systems that are critical to you.

Once you have done all that, you can go back to management and the lines of business and tell them here is the reality – what risk are you willing to accept? This is NOT an IT problem. This is a business problem.

The business will consider the likelihood of the event – even after Nashville, an RV filled with explosives is an unlikely event and the cost to mitigate the problem is likely high. For some systems the cost may be low enough and the risk high enough that management says fix it. For other systems, probably not.

The key point is that everyone from the lines of business to management to the Board all understand what the risks are and what the mitigation costs are. From this data, they can make an informed BUSINESS decision on what to do.

If you need help with this, please contact us.

The Cloud Adds New Security Risks

Yesterday’s double trouble outage should remind businesses that planning for outages and continuing to operate is not optional.

The first outage was at Microsoft where it’s Active Directory services had some problems. Active Directory is used to “authenticate” users and services, so if it doesn’t work, not much else does.

The good news is that it happened towards the end of the work day (around 5:30 PM Eastern time for about 3 hours or so), so some of the pain was deflected. This particular type of outage is hard to build in redundancy for because it affected the behind the scenes infrastructure.

The second trouble was when 911 services in many communities in 14 states went down around 4:30 PM Mountain time. There was some question about whether these two were related, but based on what we are hearing, that is not the case. Losing 911 services is slightly more important than, saying, losing access to Twitter, even though the current occupant of the White House might disagree with that.

Like many companies, Public Safety Access Points or PSAPs, which is the technical name for 911 call centers, have outsourced some or all of their tech. Both companies involved with yesterday’s 911 outage have recently changed their name – likely to shed the reputations they had before. The company that the PSAPs contract with is Intrado, formerly known as West Safety Communications. Intrado says their outage was the fault of one of their vendors, Lumen. Many of you know Lumen as the company formerly known as Centurylink (actually, it is a piece of Centurylink).

The bottom line here is that whether you are a business selling or servicing widgets or a 911 operator, you are dependent on tech and more and more, you are dependent on the cloud. You are also dependent on third parties.

You need to decide how long you are willing to be down and how often. In general, cloud services are reliable. Some more than others. But you have lost some insight into tradeoffs being made by virtue of moving to the cloud and using third party vendors. These vendors are trying to save money. While you might agree with their decisions, you are never consulted and likely never informed.

You may be okay with this, but it should be a conscious decision, not something that happens accidentally.

Do you have a disaster recovery plan? Or a business continuity plan? When was it last tested? Are you happy with the results?

These outages were relatively short-lived. For most people the Microsoft outage affected them for around 3-4 hours. For the 911 outage, it lasted for around 1-2 hours. But many of these outages have lasted much longer than that.

Have you asked your vendors (cloud or otherwise) about their plans? Do you believe them? Are their meaningful penalties in the contract to cover your losses and your customers’ losses? Are you okay with the inevitable outages?

Consider this outage an opportunity. Credit: Brian Krebs

Who Wants to Hear Fiction About System Recovery Time

A survey of small and medium size businesses asked executives about their Recovery Time Objectives or RTOs. A company’s RTO represents the amount of time a system, such as a web site, can be down after an incident. The incident could be a software error, hardware failure, ransomware attack or many other things. Here are some of the answers they got.

  • 92% of SMB executives said they believe their businesses are prepared to recover from a disaster.

My first question for these executives is when was the last time you TESTED that preparation and what was the result? My guess is that the primary answer will be that it has never been tested.

20 percent say that they do not have a data backup or disaster recovery solution in place. If so, how are they prepared to recover?

  • 16% of executives say that they do not know their own recovery time objectives, but 24% expect to recover in less than 10 minutes and 29% expect to recover in less than an hour.

So, while 20% don’t even have a data backup solution in place, more than half expect to recover in less than an hour.

The results are from a survey of 500 SMB execs; 87% of which were CEOs.

  • Of those who said they knew what their RTOs are, 9% said it was less than one minute, 30% said it was under an hour and 17% said it was under a day.

Compare that to recent ransomware attacks. Atlanta took several months to recover. Travelex was down for over a month.

How do all of these SMB execs figure they are smarter than these guys who took weeks and months to recover?

Another problem is that people don’t agree on what the definition of a disaster is. Is it recovering from a data loss or recovering from a malware attack or the ability to become operational quickly or what?

Bottom line – executives need to understand this recovery thing because experience tells me that it takes way longer to recover than people seem to think it does. And, for most companies, if their systems are down, they are not making money and are spending money.

If executives think they have a handle on this – conduct a mock disaster drill and see how long recovery takes. For most companies it will not be 10 minutes or an hour.

Need some help figuring this out? Contact us. Credit: Help Net Security

Country of Georgia Hacked

Well it seemed like the whole damn country.

Over 15,000 website have been hacked, including, not surprisingly, newspapers, government offices and TV stations.

After the sites were defaced by the hackers, they were taken offline.

Newspapers said it was the biggest attack in the country’s history, even bigger than the 2008 attack by Russia.

This attack even affected some of the country’s courts and banks.

Needless to say, and based on the history with Russia, there was some panic around.

However a web hosting company, Pro-service, admitted that their network was attacked.

By late in the day more than half of the sites were back online and they were working on the rest.

The hackers defaced the sites with a picture of former president Mikheil Saakashvili, with the text “I’ll be back” overlaid on top.

Saakashvili is in exile in Ukraine now but was generally thought to be anti-corruption, so it is unlikely that Russia did it this time, but it seems to be politically motivated.

At least two TV stations went off the air right after the attack.

Given that Georgia (formerly known as the Republic of Georgia) is not vital to you and me on an everyday basis, why should we care.

The answer is that just because hackers attacked them today — if it could be done there, it could be done here too.  Oh.  Wait.  They already did that (see here).  In that case, it was the Chinese and the damage was much greater.

The interesting part for both the Chinese attack on us and the <whoever did it> attack on Georgia is that one attack on a piece of shared infrastructure can do an amazing amount of damage.

Think about what happens when Amazon, Microsoft or Google go down – even without a cyberattack.

The folks in DC are already planning how to respond to an attack on shared infrastructure like banking, power, water, transportation and other critical infrastructure.  You and I don’t have much ability to impact that part of the conversation, but we do have impact on our own infrastructure.

Apparently this attack was pretty simple and didn’t do much damage, but that doesn’t mean that some other attack will also be low tech or do little damage.  What if an attack disabled one or a few Microsoft or Amazon data centers.  Microsoft is already rationing VMs in US East 2 due to lack of capacity.  What would happen if they lost an entire data center?

This falls under the category of disaster recovery and  business continuity.  Hackers are only one case, but the issue of shared infrastructure makes the impact much greater.  If all of your servers were in your office like they used to be, then attacks would be more localized.  But there are many advantages to cloud infrastructure, so I am not suggesting going back to the days of servers in a closet.

Maybe Microsoft or Amazon are resilient enough to withstand an attack (although it seems like self inflicted wounds already do quite a bit of damage without the help of outside attackers), but what about smaller cloud providers?

What if one or more of your key cloud providers had an outage?  Are you ready to handle that?  As we saw with the planned power outages in California this past week, stores who lost power had to lock their doors because their cash registers didn’t work.  Since nothing has a price on it any more, they couldn’t even take cash  – assuming you could find a gas station to fill your car or an ATM to get you that cash.

Bottom line is that shared infrastructure is everywhere and we need to plan for what we are going to do — not if, but when –, that shared infrastructure takes a vacation.

Plan now.  The alternative may be to shut the doors until the outage gets fixed and if that takes a while, those doors may be locked forever.

The Cloud is NOT Disasterproof – Are You

Over the weekend, Google suffered an outage that lasted about 4 hours. (See Google Appstatus Dashboard)

The good news is that the outage happened on a Sunday afternoon because that reduced the impact of the outage.   Next time it could happen on a Monday morning instead.

The outage took down virtually every Google service at some point during the outage.

But worse than that, it took down all of those companies that depended on one Google service or another.  Examples include Snapchat, Shopify, Discord and even a number of Apple services went down because Apple is not in the data center business.  iCloud mail and drive and iMessage were all affected.

This is not to beat up on Google.  Both Amazon and  Microsoft have had similar meltdowns and so have much smaller providers.

And they will again.  Human beings design computers, build computers and operate them.  And, after all, humans are, well, just human.

One more time, this is a lesson for users of cloud services.  

Maybe you can deal with a 4 hour outage on a Sunday.

But can you deal with an 8 hour or 24 hour outage on a Wednesday (like Microsoft had recently)?

What is the cost in lost productivity when users can’t get to their email or their office documents?

What is the impact to your customers if they can’t get to your service?  Will they move to a competitor?  And stay there?

I am not proposing any solution.  What I am proposing that you consider what the impact is of an outage like this.  Impact on both YOU and also on your CUSTOMER.

Then you need to consider what the business risk is of an inevitable outage and what your business continuity plan is.  Will your BC plan sufficiently mitigate the risk to a level that is acceptable to your company.

Finally, you need to look at your Vendor Cyber Risk Management program.  

Apple’s systems went down on Sunday NOT due anything Apple did, but rather something their vendor (Google) did.

At this point Google has not said what happened, but they said they will provide an after action report soon.  But, remember, this is not, ultimately, a Google problem, but rather a problem with cloud consolidation.  When there are only a handful of cloud providers hosting everything (3 tier one providers — Google, Microsoft and Amazon) and a slightly larger handful of tier two providers, if one of them burps, a lot of companies get indigestion.

Source: Vice 

 

News Bites for the Week Ending November 30, 2018

Microsoft Azure and O.365 Multi-Factor Authentication Outage

Microsoft’s cloud environment had an outage this week for the better part of a day, worldwide.  The failure stopped users who had turned on two factor authentication from logging in.

This is not a “gee, Microsoft is bad” or “gee, two factor authentication is bad” problem.  All systems have failures, especially the ones that businesses run internally.  Unfortunately cloud systems fail occasionally too.

The bigger question is are you prepared for that guaranteed, some time in the future, failure?

It is a really bad idea to assume cloud systems will not fail, whether they are from a particular industry specific application or a generic one like Microsoft or Google.

What is your acceptable length for an outage?  How much data are you willing to lose?

More importantly, do you have a plan for what to do in case you pass those points of no return and have you recently tested those plans?

Failures usually happen when it is inconvenient and planning is critical to dealing with it.  Dealing with an outage absent a well thought out and tested plan is likely to be a disaster. Source: ZDNet.

 

Moody’s is Going to Start Including Cyber Risk in Credit Ratings

We have said for a long time that cyber risk is a business problem.  Business credit ratings represent the overall risk a business represents.

What has been missing is connecting the two.

Now Moody’s is going to do that.

While details are scarce, Moody’s says that they will soon evaluate organizations risk from a cyber attack.

Moody’s has even created a new cyber risk group.

While they haven’t said so yet, likely candidates for initial scrutiny of cyber risk are defense contractors, financial, health care and critical infrastructure.

For companies that care about their risk ratings, make sure that your cybersecurity is in order along with your finances.  Source: CNBC.

 

British Lawmakers Seize Facebook Files

In what has got to be an interesting game, full of innuendo and intrigue, British lawmakers seized documents sealed by a U.S. court when the CEO of a company that had access to them visited England.

The short version of the back story is that the Brits are not real happy with Facebook and were looking for copies of documents that had been part of discovery in a lawsuit between app maker Six4Three and Facebook that has been going on for years.

So, when Ted Kramer, founder of the company visited England on business, the Parliament’s Sargent-at-arms literally hauled Ted into Parliament and threatened to throw him in jail if he did not produce the documents sealed by the U.S. court.

So Ted is between a rock and a hard place;  the Brits have physical custody of him;  the U.S. courts could hold him in contempt (I suspect they will huff and puff a lot, but not do anything) – so he turns over the documents.

Facebook has been trying to hide these documents for years.  I suspect that Six4Three would be happy if they became public.  Facebook said, after the fact, that the Brits should return the documents.  The Brits said go stick it.  You get the idea.

Did Six4Three play a part in this drama in hopes of getting these emails released?  Don’t know but I would not rule that out.  Source: CNBC.

 

Two More Hospitals Hit By Ransomware

The East Ohio Regional Hospital (EORH) and Ohio Valley Medical Center (OVMC) were both hit by a ransomware attack.  The hospitals reverted to using paper patient charts and are sending ambulances to other hospitals.  Of course they are saying that patient care isn’t affected, but given you have no information available to you regarding patients currently in the hospital, their diagnoses, tests or prior treatments, that seems a bit optimistic.

While most of us do not deal with life and death situations, it can take a while – weeks or longer – to recover from ransomware attacks if the organization is not prepared.

Are you prepared?  In this case, likely one doctor or nurse clicked on the wrong link;  that is all it takes.  Source: EHR Intelligence.

 

Atrium Health Data Breach – Over 2 Million Customers Impacted

Atrium Health announced a breach of the personal information of over 2 million customers including Socials for about 700,000 of them.

However, while Atrium gets to pay the fine, it was actually the fault of one of their vendors, Accudoc.  Accudoc does billing for them for their 44 hospitals.

Atrium says that the data was accessed but not downloaded and did not include credit card data.  Of course if the bad guys “accessed” the data and then screen scraped it, it would not show as downloaded.

One more time – VENDOR CYBER RISK MANAGEMENT.  It has to be a priority.   Unless you don’t mind taking the rap and fines for your vendor’s errors.   Source: Charlotte Observer.