Category Archives: Business Continuity

The Cloud Adds New Security Risks

Yesterday’s double trouble outage should remind businesses that planning for outages and continuing to operate is not optional.

The first outage was at Microsoft where it’s Active Directory services had some problems. Active Directory is used to “authenticate” users and services, so if it doesn’t work, not much else does.

The good news is that it happened towards the end of the work day (around 5:30 PM Eastern time for about 3 hours or so), so some of the pain was deflected. This particular type of outage is hard to build in redundancy for because it affected the behind the scenes infrastructure.

The second trouble was when 911 services in many communities in 14 states went down around 4:30 PM Mountain time. There was some question about whether these two were related, but based on what we are hearing, that is not the case. Losing 911 services is slightly more important than, saying, losing access to Twitter, even though the current occupant of the White House might disagree with that.

Like many companies, Public Safety Access Points or PSAPs, which is the technical name for 911 call centers, have outsourced some or all of their tech. Both companies involved with yesterday’s 911 outage have recently changed their name – likely to shed the reputations they had before. The company that the PSAPs contract with is Intrado, formerly known as West Safety Communications. Intrado says their outage was the fault of one of their vendors, Lumen. Many of you know Lumen as the company formerly known as Centurylink (actually, it is a piece of Centurylink).

The bottom line here is that whether you are a business selling or servicing widgets or a 911 operator, you are dependent on tech and more and more, you are dependent on the cloud. You are also dependent on third parties.

You need to decide how long you are willing to be down and how often. In general, cloud services are reliable. Some more than others. But you have lost some insight into tradeoffs being made by virtue of moving to the cloud and using third party vendors. These vendors are trying to save money. While you might agree with their decisions, you are never consulted and likely never informed.

You may be okay with this, but it should be a conscious decision, not something that happens accidentally.

Do you have a disaster recovery plan? Or a business continuity plan? When was it last tested? Are you happy with the results?

These outages were relatively short-lived. For most people the Microsoft outage affected them for around 3-4 hours. For the 911 outage, it lasted for around 1-2 hours. But many of these outages have lasted much longer than that.

Have you asked your vendors (cloud or otherwise) about their plans? Do you believe them? Are their meaningful penalties in the contract to cover your losses and your customers’ losses? Are you okay with the inevitable outages?

Consider this outage an opportunity. Credit: Brian Krebs

Who Wants to Hear Fiction About System Recovery Time

A survey of small and medium size businesses asked executives about their Recovery Time Objectives or RTOs. A company’s RTO represents the amount of time a system, such as a web site, can be down after an incident. The incident could be a software error, hardware failure, ransomware attack or many other things. Here are some of the answers they got.

  • 92% of SMB executives said they believe their businesses are prepared to recover from a disaster.

My first question for these executives is when was the last time you TESTED that preparation and what was the result? My guess is that the primary answer will be that it has never been tested.

20 percent say that they do not have a data backup or disaster recovery solution in place. If so, how are they prepared to recover?

  • 16% of executives say that they do not know their own recovery time objectives, but 24% expect to recover in less than 10 minutes and 29% expect to recover in less than an hour.

So, while 20% don’t even have a data backup solution in place, more than half expect to recover in less than an hour.

The results are from a survey of 500 SMB execs; 87% of which were CEOs.

  • Of those who said they knew what their RTOs are, 9% said it was less than one minute, 30% said it was under an hour and 17% said it was under a day.

Compare that to recent ransomware attacks. Atlanta took several months to recover. Travelex was down for over a month.

How do all of these SMB execs figure they are smarter than these guys who took weeks and months to recover?

Another problem is that people don’t agree on what the definition of a disaster is. Is it recovering from a data loss or recovering from a malware attack or the ability to become operational quickly or what?

Bottom line – executives need to understand this recovery thing because experience tells me that it takes way longer to recover than people seem to think it does. And, for most companies, if their systems are down, they are not making money and are spending money.

If executives think they have a handle on this – conduct a mock disaster drill and see how long recovery takes. For most companies it will not be 10 minutes or an hour.

Need some help figuring this out? Contact us. Credit: Help Net Security

Country of Georgia Hacked

Well it seemed like the whole damn country.

Over 15,000 website have been hacked, including, not surprisingly, newspapers, government offices and TV stations.

After the sites were defaced by the hackers, they were taken offline.

Newspapers said it was the biggest attack in the country’s history, even bigger than the 2008 attack by Russia.

This attack even affected some of the country’s courts and banks.

Needless to say, and based on the history with Russia, there was some panic around.

However a web hosting company, Pro-service, admitted that their network was attacked.

By late in the day more than half of the sites were back online and they were working on the rest.

The hackers defaced the sites with a picture of former president Mikheil Saakashvili, with the text “I’ll be back” overlaid on top.

Saakashvili is in exile in Ukraine now but was generally thought to be anti-corruption, so it is unlikely that Russia did it this time, but it seems to be politically motivated.

At least two TV stations went off the air right after the attack.

Given that Georgia (formerly known as the Republic of Georgia) is not vital to you and me on an everyday basis, why should we care.

The answer is that just because hackers attacked them today — if it could be done there, it could be done here too.  Oh.  Wait.  They already did that (see here).  In that case, it was the Chinese and the damage was much greater.

The interesting part for both the Chinese attack on us and the <whoever did it> attack on Georgia is that one attack on a piece of shared infrastructure can do an amazing amount of damage.

Think about what happens when Amazon, Microsoft or Google go down – even without a cyberattack.

The folks in DC are already planning how to respond to an attack on shared infrastructure like banking, power, water, transportation and other critical infrastructure.  You and I don’t have much ability to impact that part of the conversation, but we do have impact on our own infrastructure.

Apparently this attack was pretty simple and didn’t do much damage, but that doesn’t mean that some other attack will also be low tech or do little damage.  What if an attack disabled one or a few Microsoft or Amazon data centers.  Microsoft is already rationing VMs in US East 2 due to lack of capacity.  What would happen if they lost an entire data center?

This falls under the category of disaster recovery and  business continuity.  Hackers are only one case, but the issue of shared infrastructure makes the impact much greater.  If all of your servers were in your office like they used to be, then attacks would be more localized.  But there are many advantages to cloud infrastructure, so I am not suggesting going back to the days of servers in a closet.

Maybe Microsoft or Amazon are resilient enough to withstand an attack (although it seems like self inflicted wounds already do quite a bit of damage without the help of outside attackers), but what about smaller cloud providers?

What if one or more of your key cloud providers had an outage?  Are you ready to handle that?  As we saw with the planned power outages in California this past week, stores who lost power had to lock their doors because their cash registers didn’t work.  Since nothing has a price on it any more, they couldn’t even take cash  – assuming you could find a gas station to fill your car or an ATM to get you that cash.

Bottom line is that shared infrastructure is everywhere and we need to plan for what we are going to do — not if, but when –, that shared infrastructure takes a vacation.

Plan now.  The alternative may be to shut the doors until the outage gets fixed and if that takes a while, those doors may be locked forever.

Is Your DR Plan Better Than London Gatwick Airport’s?

Let’s assume that you are a major international airport that moves 45 million passengers and 97,000 tons of cargo a year,

Then let’s say you have some form of IT failure.  How do you communicate with your customers?

At London’s Gatwick airport, apparently your DR plan consists of trotting out a small white board and giving a customer service agent a dry erase marker and a walkie-talkie.

On the bright side, they are using black markers for on time flights and red markers for others.

Gatwick is blaming Vodafone for the outage.  Vodafone does contract with Gatwick for certain IT services.

You would think that an organization as large as Gatwick would have a well planned and tested Disaster Recovery strategy, but it would appear that they don’t.

Things, they say, will get back to normal as soon as possible.

Vodafone is saying:

We have identified a damaged fibre cable which is used by Gatwick Airport to display flight information.

"Our engineers are working hard to fix the cable as quickly as possible. 

This is a top priority for us and we are very sorry for any problems caused by this issue.

But who is being blasted in social media as “absolute shambles”, “utter carnage” and “huge delays”?  Not Vodafone.

Passengers are snapping cell phone pictures and posting to social media with snarky comments.

Are you prepared for an IT outage?

First of all, there are a lot of possible failures that could happen.  In this case, it was a fiber cut that somehow took everything out.  Your mission, should you decide to accept it, is to identify all the possible failures.  Warning, if you do a good job of brainstorming, there will be a LOT.

Next you want to triage those modes.  Some of them will have a common root cause or a common possible fix.  Others you won’t really know what the fix is.

You also want to identify the impact of each failure.  In Gatwick’s case, the failure of all of the sign boards throughout the airport, while extremely embarrassing and which will generate a lot of ridicule on social media is probably less critical than a failure of the gate management software which would basically stop planes from landing because there would not be a way to get those planes assigned to a gate.  A failure of the baggage automation system would stop them from loading and unloading bags, which represents a big problem.  

Once you have done all that, you can decide which failures you are willing to live with and which ones are a problem.

Then you can brainstorm ways to mitigate the failure.  Apparently, in Gatwick’s case, rounding up a few white boards, felt tip markers and walkie talkies was considered acceptable.

After the beating they took today on social media, they may be reconsidering that decision.

In some cases you may want an automated disaster recovery solution;  in other cases, a manual one may be acceptable and in still other ones, having an outage until it is fixed may be OK.

Time may play a factor into this answer also.  For example, if the payroll system goes down but the next payroll isn’t for a week, it MAY not be a problem at all, but if payroll has to be produced today or tomorrow, it could be a big problem.

All of this will be part of your business continuity and disaster recovery program.

Once you have this disaster recovery and business continuity program written down, you need to create a team to run it, train them and test it.  And test it.  And test it.  When I was a kid there was a big power failure in the northeast.  There was a large teaching hospital in town that lost power, but, unfortunately, no one had trained people on how to start the generators.  That meant that for several hours until they found the only guy who knew how to start the generators, nurses were manually running heart lung machines and other critical patient equipment by hand.  They fixed that problem immediately after the blackout so the next time it happened, all people saw was a blink of the lights.  Test.  Test.  Test!

If this seems overwhelming, please contact us and we will be pleased to assist you.

Information for this post came from Sky News.

 

Lessons From LabCorp

As I wrote last week, LabCorp, the mega medical lab testing company (mega as in revenue around $10 billion last year) was breached and  they have provided some interesting insights as they have been forced to detail to the SEC some of what happened last week when they had to shut down large parts of their network unannounced, putting a stop to testing of lab samples, both in house and on the way.

From what we are gleaning from their filings, they were hit with a ransomware attack, likely a SamSam variant which seems to have an affection for the healthcare industry.

They claim that their Security Operations Center was notified, we assume automatically, when the first computer was infected.

That, by itself, is pretty amazing.  I bet less than one percent of U.S. companies could achieve that benchmark.

Then, they say, they were able to contain the malware within 50 minutes of the first alert.  That too is pretty amazing.  In order to that, you have to know what you are dealing with and how it spreads.  Then you have to figure out which “circuit breakers” to trip in order to contain the malware.  The City of Denver was hit with a Denial of Service attack a couple of years ago and it took them, they say, a couple of hours to figure out how to disconnect from the Internet.  That is more typical than what LabCorp was able to do.

The attack started at around midnight, of course, when the least number of people were around to deal with it.  If you factor that in to the 50 minute containment time, that is pretty impressive.

However, in that very short 50 minute interval, 7,000 systems were infected including 1,900 servers.  Those numbers are not so good.  Of the 1,900 servers, 300 of these were production servers.  That is really not so good.

One of the attack vectors of SamSam is an old Microsoft Protocol called Remote Desktop protocol or RDP.

RDP should never  be publicly accessible and we don’t know if it was here and if used internally, it should be severely limited and where it is needed, it should require multifactor authentication.  While we don’t know, it is likely that this was the attack vector and they did not have multifactor authentication turned on.  Hopefully as part of their lessons learned, they will change that.

Within a few days they claimed they had 90% of their systems back.  It is not clear whether that is 90% of 7,000, which would be quite impressive or 90% of 300, which would be much less impressive but still good.

So what are the takeaways from this?

These conclusions are based mostly on what we can interpret, since they are not saying much.  This is likely because they are afraid of being sued and also what HIPAA sanctions they might get.

  • They seem to have excellent monitoring and alerting since they were able to detect the attack very quickly.
  • They also must have a good security operations center since they were able to identify what they were dealing with and contain it within 50 minutes.
  • On the other end of the spectrum, the malware was able to infect 7,000 machines including some production machines.  They probably need to work on this one.
  • Assuming RDP was the infection vector, that should not have happened at all – they lose points for this one.
  • They were able to restart a significant number of machines pretty quickly so it would appear that they have some degree of disaster recovery.
  • On the other hand, given that they had to shut down their network and stop processing lab work, it says that their business continuity process could use some work.
  • Finally, they claim that they were able to KNOW that none of the data was removed from the network.  I would say that 99% of companies could not do that.

Overall, you can compare how your company stacks up against LabCorp and figure out where you can improve.

Using other company’s bad luck to learn lessons is probably the least expensive way to improve your security.

I suggest that this is a great breach from which to learn lessons.

Information for this post came from CSO Online.

 

 

 

How to Spend $100 Million Without Even Trying

UPDATE: The Sun, not always the most reliable information source, is saying the outage and trickle down affected 300,000 passengers and may cost the airline $300+ million.  The CEO, Alex Cruz, allegedly said, when warned earlier about the new system installed last fall, that it was the staff’s fault, not the system’s, that things were not working as desired.   Cruz, trying to rein in the damage, said in an email to staff to stop talking about about what happened.  Others have said that the people at Tata did not have the skills to start up and run the backup system – certainly not the first time you wind up with a bumpy situation when you replace on-shore resources with much lower paid off-shore resources – resources who have zero history in the care and feeding of that particular very complex system.  Even if the folks at Tata were experienced at operating some complex computer system, no two systems are the same and there is so much chewing gum and bailing wire in the airline industry holding systems together, that without that legacy knowledge of that particular system, likely no one could make it work right.  

Of all of the weekends for an airline to have a computer systems meltdown, Memorial Day weekend is probably not the one that you would pick.

Unfortunately for British Airways, they didn’t get to “pick” when the event happened.

 

Early Saturday British Airways had a systems meltdown.  This really is a meltdown since the web site and mobile apps stopped working, passengers could not check in and employees could not manage flights, among other things.

Passengers at London’s two largest airports – Heathrow and Gatwick – were not getting any information from the staff.  Likely this was due to the fact that the systems that the staff normally used to get information were not working.

Initially, BA cancelled all flights out of London until 6 PM on Saturday, but later cancelled all flights out of London all day.

Estimates are that 1,000 flights were cancelled.

Given this is a holiday weekend, likely every flight was full.  If you conservatively assume 100 passengers per flight, cancelling 1,000 flights affected 100,000 passengers.  Given the flights are all full, even if they wanted to rebook people, there probably aren’t available seats during the next couple of days.  That means that for a lot of these passengers, they are going to have to cancel their trips.  Given that the airline couldn’t blame the weather or other natural disasters, they will likely have to refund passengers their money.  This doesn’t mean giving people credit towards a future trip, but rather writing them a check.

In Britain, airlines are required to pay penalties of up to 600 Euros per passenger, depending on the length of the delay and the length of the flight.

In addition they are required to pay for food and drinks and pay for accommodations if the delay is overnight – and potentially multiple nights.

Of course there are IT people working around the clock trying to apply enough Band-Aids to get traffic moving again.

Estimates are, so far, that this could cost the airline $100 million or more.  Another estimate says close to $200 million.  Hopefully they have insurance for this, but carrying $200 million in business interruption insurance is unlikely and many BI policies have a waiting period – say 12 hours – before the policy kicks in.

But besides this being an interesting story – assuming you were not travelling in, out or through London this weekend – there is another side of the story.

First, one of the unions blamed BA’s decision to outsource IT to a firm in India (Tata).  BA said that was not the problem.  It is true that BA has been trying to reduce costs in order to compete with low cost carriers, so who knows.  In any case, when you outsource, you really do need to make sure that you understand the risks and that doesn’t matter whether the outsourcer is local or across the globe.  We may hear in the future what happened, but, due to lawsuits, we may only hear about what happened inside of a courtroom.

Apparently, the disaster recovery systems didn’t come on line after the failure as they should have.  Whether that was due to cost reduction and it’s associated secondary effects or not we may never know.

More importantly, it is certainly clear that British Airways disaster recovery and business continuity plan was not prepared for an event like this.

One one point the CEO of BA was forced to say, on the public media, that people should stay away from the airport.  Don’t come.  Stay home.  From a branding standpoint, it doesn’t get much worse than that.  Fly BA – Please stay home.

As part of the disaster recovery plan, you need to consider contingencies.  In the case of an airline, that includes when you cancel flights, how do you get bags back to your customers.  Today, two days later, people are saying that they still don’t have their luggage and they can’t get BA to answer their phones.  BA is now saying that it could be “Quite a while” before people get their luggage back and if they don’t, that is more cost for BA to cover.

One has to assume that the outcome of all of this will be a lot of lawsuits.

From a branding standpoint this has got to be pretty ugly.  You know that there has been a lot of social media chatter on the horror stories.  In one article that I read, a passenger was talking about taking a trip from London to New York and that all the money they were going to lose for things that they planned on doing when they got to New York.  Whether BA is going to have to pay for all of that is unclear, but likely at least some of it.

You also have to assume that at least some passengers will book their next flight on “any airline, as long as it is not BA”.

To be fair to BA, there have been other, large, airline IT systems failures in the last year, but this one, it’s a biggie.   Likely these failures are, at least in part, due to the complex web of automation that the airlines have cobbled together after years of cost cutting and mergers.  Many of these systems are so old that the people who wrote them are long dead and the computer languages – notably COBOL – are considered dead languages.

The fact that there were no plans (at least none that worked) for how to deal with this – how to manage tens of thousands of tired, hungry, grumpy passengers – is an indication of work for them to do.

But bringing this home, what would happen to your company if the computers stopped working and it took you a couple of days to recover.  I know in retail, where all the cash registers are computerized and nothing has a price on it any more, businesses are forced to close the store.    We saw a bigger version of that at the Colorado Mills Mall in Golden earlier this month.  In that case likely a number of businesses will fail and people will lose their jobs and their livelihoods.

My suggestion is to get people together, think about likely and not so likely events and see how well prepared your company is to deal with each of them.  Food for thought.

Information for this post came from the Guardian here and here The Next Web  and Reuters.