Category Archives: Business Continuity

Nashville Bombing Part 2

As I said last week, while the bombing is a horrible event, it does point out how brittle our telecommunications world is. That being said, for most companies, the rest of the IT infrastructure is probably more brittle.

Companies should use this as an opportunity to review their situation and see if they can make improvements at a price that is affordable.

While AT&T was able to strike a deal with the City of Nashville to commandeer Nissan Stadium, home of the Titans, to set up a replacement central office, you probably will not get the same treatment if you asked.

AT&T was also able to deploy 25 tractor trailers of computer equipment to replace the equipment that was damaged.

Finally, AT&T was able to temporarily reassign personnel with every skill that they might possibly need from fiber techs to computer programmers. Again, you likely would not be able to do that.

The question for you to “game out” is what are my critical vendors and what would I do if they had a meltdown. I don’t mean a 30 minute outage, I mean a meltdown. We have seen, for example, tech companies that have gotten hit by ransomware.

Perhaps, like many companies, you use a managed service provider or MSP. A number of MSPs have been hit by ransomware and when they do, often so do their customers. Does your MSP have the resources to defend all (or most of) its customers from a ransomware attack at once. How long would it take your MSP to get you back to working? Even large MSPs (which equals many customers) likely don’t have the resources.

If that were to happen to you – and of course, they have the only copies of your data, right? – what would they do and what would you do?

Maybe your servers are hosted in your office. There are a lot of possible events that could occur.

Even if your servers are in a colo, things can occur that can take you down.

Here is one thing to start with –

For each key system from personnel to public web sites, both internal and at third parties, document your RECOVERY TIME OBJECTIVE or RTO. The RTO is the maximum acceptable downtime before recovering. For example, for payroll, it might be 24 hours. But what if the outage happens at noon on the day that payroll must be sent to your bank? So, think carefully about what the maximum RTO is and remember that it will likely be different for different systems.

Then, for system, document the RECOVERY POINT OBJECTIVE or RPO. The RPO is the point in time, counting backward from the event, that you are willing to lose data. For example, if this is an ecommerce system, maybe you are willing to lose 30 minutes worth of orders. Or maybe 5 minutes. If it is an accounting system, maybe it is 8 hours (rekeying one day’s worth of AR and AP may be considered acceptable). Again each system will likely be different.

Then get all of the lines of business, management and the Board (if there is one) to agree on those times. Note that shorter RTOs and RPOs mean increased cost. The business units may say that they are not willing to lose any data. If you tell them that you can do that, but it will cost them a million dollars a year, they may rethink that. Or management may rethink that for them. The key point is to get everyone on the same page.

Once you have done that, make a list of the possible events that you need to deal with.

  • Someone plants a bomb in an RV outside your building and there is severe physical damage to your building.
  • Or maybe the bomb is down the block, but the force of the blast damages the water pipes in your building .
  • Or, the bomb is down the block and there is no damage to your building, but the city has turned off water, power and gas to the building. And the building is inside a police line and will be inaccessible while the police try to figure out what is going on.
  • In the case of AT&T, they had to pump three FEET of water out of the building. Water and generators are not a good mix. Neither are water and batteries. While AT&T lost their generators as a result of the blast, their batteries were distributed around the building so they did not lose ALL of their batteries.

Note that you do not need to think up all the scenarios yourself. You can look at the news reports and after-action reports from other big, public meltdowns. Here is another article on the Nashville situation.

Now create a matrix of events and systems for your RTO and RPO numbers. In the intersection box, you can say that you already can meet those objectives or that it will cost $1.29 one time to meet it or a million dollars a year. You need to include third party providers if they run and manage any systems that are critical to you.

Once you have done all that, you can go back to management and the lines of business and tell them here is the reality – what risk are you willing to accept? This is NOT an IT problem. This is a business problem.

The business will consider the likelihood of the event – even after Nashville, an RV filled with explosives is an unlikely event and the cost to mitigate the problem is likely high. For some systems the cost may be low enough and the risk high enough that management says fix it. For other systems, probably not.

The key point is that everyone from the lines of business to management to the Board all understand what the risks are and what the mitigation costs are. From this data, they can make an informed BUSINESS decision on what to do.

If you need help with this, please contact us.

Security News for the Week Ending Jan 1, 2021

Happy New Year. May 2021 be more sane than 2020.

Microsoft Says Goal of Solar Winds Attack Was Your Cloud Data

Microsoft says that the objective of the Solar Winds Hackers was to get into a number of organizations and then pick and choose which ones to attack, leaving the back door in place at the others for future operations. One way to do that was to generate SAML login tokens from inside the network and then use those tokens to gain access to cloud resources, including, but not limited to email. The article also provides more information on how to detect if the hackers did compromise your cloud resources. Credit: Bleeping Computer

“Swatting” Moves to the Next Level

Swatting, the practice of calling in a fake 911 call so that SWAT teams are deployed to the victim’s location based on, say, a fake kidnapping, are moving to the next level. As if having the police show up unexpected with lots of guns and breaking down your door isn’t bad enough, now the perpetrators are taking advantage of the fact that people choose crappy passwords and are watching and listening to the police assault on the victim’s own smart devices. On occasion, the situation becomes deadly when the police, not really knowing what to believe, shoot the victim. On rare occasions, the swatters, as they are called, are caught and prosecuted. Credit: Bleeping Computer

I Think The Wasabi Got a Little Too Hot

Wasabi, the cloud storage competitor to Amazon S3 that claims that it is significantly cheaper than Amazon and 99.999999999% reliable just got a little less reliable. Their domain registrar notified them of some malware hosted on one of their domains. Only they sent the email to the wrong email address. The registrar, following normal procedures, suspended their domain for not responding to an email they never got, knocking many of their customers offline.

After Wasabi discovered they had been DDoSed by their domain registrar, they suspended the offending customer and asked to get their domain back. That process took over 13 hours. Are you ready for this kind of attack from your suppliers?

That attack probably knocked several of those 9’s off their reliability, depending on how the mess with the data.

Credit: Bleeping Computer

Solar Winds Troubles Are Not Over

A second piece of malware called SUPERNOVA and a zero-day vulnerability that it exploited makes it look like there may have been a second attack against Solar Winds. This appears to be a separate attack from the Russian attack. The attack vector is different too – this is not an attack against Solar Winds code base. This spells additional trouble for Solar Winds. Credit: Security Week

The Cloud Adds New Security Risks

Yesterday’s double trouble outage should remind businesses that planning for outages and continuing to operate is not optional.

The first outage was at Microsoft where it’s Active Directory services had some problems. Active Directory is used to “authenticate” users and services, so if it doesn’t work, not much else does.

The good news is that it happened towards the end of the work day (around 5:30 PM Eastern time for about 3 hours or so), so some of the pain was deflected. This particular type of outage is hard to build in redundancy for because it affected the behind the scenes infrastructure.

The second trouble was when 911 services in many communities in 14 states went down around 4:30 PM Mountain time. There was some question about whether these two were related, but based on what we are hearing, that is not the case. Losing 911 services is slightly more important than, saying, losing access to Twitter, even though the current occupant of the White House might disagree with that.

Like many companies, Public Safety Access Points or PSAPs, which is the technical name for 911 call centers, have outsourced some or all of their tech. Both companies involved with yesterday’s 911 outage have recently changed their name – likely to shed the reputations they had before. The company that the PSAPs contract with is Intrado, formerly known as West Safety Communications. Intrado says their outage was the fault of one of their vendors, Lumen. Many of you know Lumen as the company formerly known as Centurylink (actually, it is a piece of Centurylink).

The bottom line here is that whether you are a business selling or servicing widgets or a 911 operator, you are dependent on tech and more and more, you are dependent on the cloud. You are also dependent on third parties.

You need to decide how long you are willing to be down and how often. In general, cloud services are reliable. Some more than others. But you have lost some insight into tradeoffs being made by virtue of moving to the cloud and using third party vendors. These vendors are trying to save money. While you might agree with their decisions, you are never consulted and likely never informed.

You may be okay with this, but it should be a conscious decision, not something that happens accidentally.

Do you have a disaster recovery plan? Or a business continuity plan? When was it last tested? Are you happy with the results?

These outages were relatively short-lived. For most people the Microsoft outage affected them for around 3-4 hours. For the 911 outage, it lasted for around 1-2 hours. But many of these outages have lasted much longer than that.

Have you asked your vendors (cloud or otherwise) about their plans? Do you believe them? Are their meaningful penalties in the contract to cover your losses and your customers’ losses? Are you okay with the inevitable outages?

Consider this outage an opportunity. Credit: Brian Krebs

Who Wants to Hear Fiction About System Recovery Time

A survey of small and medium size businesses asked executives about their Recovery Time Objectives or RTOs. A company’s RTO represents the amount of time a system, such as a web site, can be down after an incident. The incident could be a software error, hardware failure, ransomware attack or many other things. Here are some of the answers they got.

  • 92% of SMB executives said they believe their businesses are prepared to recover from a disaster.

My first question for these executives is when was the last time you TESTED that preparation and what was the result? My guess is that the primary answer will be that it has never been tested.

20 percent say that they do not have a data backup or disaster recovery solution in place. If so, how are they prepared to recover?

  • 16% of executives say that they do not know their own recovery time objectives, but 24% expect to recover in less than 10 minutes and 29% expect to recover in less than an hour.

So, while 20% don’t even have a data backup solution in place, more than half expect to recover in less than an hour.

The results are from a survey of 500 SMB execs; 87% of which were CEOs.

  • Of those who said they knew what their RTOs are, 9% said it was less than one minute, 30% said it was under an hour and 17% said it was under a day.

Compare that to recent ransomware attacks. Atlanta took several months to recover. Travelex was down for over a month.

How do all of these SMB execs figure they are smarter than these guys who took weeks and months to recover?

Another problem is that people don’t agree on what the definition of a disaster is. Is it recovering from a data loss or recovering from a malware attack or the ability to become operational quickly or what?

Bottom line – executives need to understand this recovery thing because experience tells me that it takes way longer to recover than people seem to think it does. And, for most companies, if their systems are down, they are not making money and are spending money.

If executives think they have a handle on this – conduct a mock disaster drill and see how long recovery takes. For most companies it will not be 10 minutes or an hour.

Need some help figuring this out? Contact us. Credit: Help Net Security

Country of Georgia Hacked

Well it seemed like the whole damn country.

Over 15,000 website have been hacked, including, not surprisingly, newspapers, government offices and TV stations.

After the sites were defaced by the hackers, they were taken offline.

Newspapers said it was the biggest attack in the country’s history, even bigger than the 2008 attack by Russia.

This attack even affected some of the country’s courts and banks.

Needless to say, and based on the history with Russia, there was some panic around.

However a web hosting company, Pro-service, admitted that their network was attacked.

By late in the day more than half of the sites were back online and they were working on the rest.

The hackers defaced the sites with a picture of former president Mikheil Saakashvili, with the text “I’ll be back” overlaid on top.

Saakashvili is in exile in Ukraine now but was generally thought to be anti-corruption, so it is unlikely that Russia did it this time, but it seems to be politically motivated.

At least two TV stations went off the air right after the attack.

Given that Georgia (formerly known as the Republic of Georgia) is not vital to you and me on an everyday basis, why should we care.

The answer is that just because hackers attacked them today — if it could be done there, it could be done here too.  Oh.  Wait.  They already did that (see here).  In that case, it was the Chinese and the damage was much greater.

The interesting part for both the Chinese attack on us and the <whoever did it> attack on Georgia is that one attack on a piece of shared infrastructure can do an amazing amount of damage.

Think about what happens when Amazon, Microsoft or Google go down – even without a cyberattack.

The folks in DC are already planning how to respond to an attack on shared infrastructure like banking, power, water, transportation and other critical infrastructure.  You and I don’t have much ability to impact that part of the conversation, but we do have impact on our own infrastructure.

Apparently this attack was pretty simple and didn’t do much damage, but that doesn’t mean that some other attack will also be low tech or do little damage.  What if an attack disabled one or a few Microsoft or Amazon data centers.  Microsoft is already rationing VMs in US East 2 due to lack of capacity.  What would happen if they lost an entire data center?

This falls under the category of disaster recovery and  business continuity.  Hackers are only one case, but the issue of shared infrastructure makes the impact much greater.  If all of your servers were in your office like they used to be, then attacks would be more localized.  But there are many advantages to cloud infrastructure, so I am not suggesting going back to the days of servers in a closet.

Maybe Microsoft or Amazon are resilient enough to withstand an attack (although it seems like self inflicted wounds already do quite a bit of damage without the help of outside attackers), but what about smaller cloud providers?

What if one or more of your key cloud providers had an outage?  Are you ready to handle that?  As we saw with the planned power outages in California this past week, stores who lost power had to lock their doors because their cash registers didn’t work.  Since nothing has a price on it any more, they couldn’t even take cash  – assuming you could find a gas station to fill your car or an ATM to get you that cash.

Bottom line is that shared infrastructure is everywhere and we need to plan for what we are going to do — not if, but when –, that shared infrastructure takes a vacation.

Plan now.  The alternative may be to shut the doors until the outage gets fixed and if that takes a while, those doors may be locked forever.

Is Your DR Plan Better Than London Gatwick Airport’s?

Let’s assume that you are a major international airport that moves 45 million passengers and 97,000 tons of cargo a year,

Then let’s say you have some form of IT failure.  How do you communicate with your customers?

At London’s Gatwick airport, apparently your DR plan consists of trotting out a small white board and giving a customer service agent a dry erase marker and a walkie-talkie.

On the bright side, they are using black markers for on time flights and red markers for others.

Gatwick is blaming Vodafone for the outage.  Vodafone does contract with Gatwick for certain IT services.

You would think that an organization as large as Gatwick would have a well planned and tested Disaster Recovery strategy, but it would appear that they don’t.

Things, they say, will get back to normal as soon as possible.

Vodafone is saying:

We have identified a damaged fibre cable which is used by Gatwick Airport to display flight information.

"Our engineers are working hard to fix the cable as quickly as possible. 

This is a top priority for us and we are very sorry for any problems caused by this issue.

But who is being blasted in social media as “absolute shambles”, “utter carnage” and “huge delays”?  Not Vodafone.

Passengers are snapping cell phone pictures and posting to social media with snarky comments.

Are you prepared for an IT outage?

First of all, there are a lot of possible failures that could happen.  In this case, it was a fiber cut that somehow took everything out.  Your mission, should you decide to accept it, is to identify all the possible failures.  Warning, if you do a good job of brainstorming, there will be a LOT.

Next you want to triage those modes.  Some of them will have a common root cause or a common possible fix.  Others you won’t really know what the fix is.

You also want to identify the impact of each failure.  In Gatwick’s case, the failure of all of the sign boards throughout the airport, while extremely embarrassing and which will generate a lot of ridicule on social media is probably less critical than a failure of the gate management software which would basically stop planes from landing because there would not be a way to get those planes assigned to a gate.  A failure of the baggage automation system would stop them from loading and unloading bags, which represents a big problem.  

Once you have done all that, you can decide which failures you are willing to live with and which ones are a problem.

Then you can brainstorm ways to mitigate the failure.  Apparently, in Gatwick’s case, rounding up a few white boards, felt tip markers and walkie talkies was considered acceptable.

After the beating they took today on social media, they may be reconsidering that decision.

In some cases you may want an automated disaster recovery solution;  in other cases, a manual one may be acceptable and in still other ones, having an outage until it is fixed may be OK.

Time may play a factor into this answer also.  For example, if the payroll system goes down but the next payroll isn’t for a week, it MAY not be a problem at all, but if payroll has to be produced today or tomorrow, it could be a big problem.

All of this will be part of your business continuity and disaster recovery program.

Once you have this disaster recovery and business continuity program written down, you need to create a team to run it, train them and test it.  And test it.  And test it.  When I was a kid there was a big power failure in the northeast.  There was a large teaching hospital in town that lost power, but, unfortunately, no one had trained people on how to start the generators.  That meant that for several hours until they found the only guy who knew how to start the generators, nurses were manually running heart lung machines and other critical patient equipment by hand.  They fixed that problem immediately after the blackout so the next time it happened, all people saw was a blink of the lights.  Test.  Test.  Test!

If this seems overwhelming, please contact us and we will be pleased to assist you.

Information for this post came from Sky News.