How a computer problem can shut down an airline like Delta
'We've kind of painted ourselves into a corner where we must rely on computer systems,' prof says
The system outagethat left thousands of Delta Air Linespassengers around the worldfacing flight cancellations and delayson Monday shows how computer-dependent society has become and airlines have to decide if their backup technologies are goodenough to deal with that reality, a Canadian computer networking expert says.
"We've kind of painted ourselves into a corner where we must rely on computer systems," saidSrinivasanKeshav, aprofessor of computer science atthe University of Waterloo.
"[But] we have now been able to build systems which are very tolerant of losses, of parts of the system being taken down."
- 'What a nightmare': Nearly 250flights cancelled next dayas Delta recovers from system outage
- 650 Delta flights cancelled after worldwide outage
The key, Keshav said, is to adopt the model that technology leaders like Googlehave known as "system fault tolerance," which assumes any single component in a computer network can fail at any time, but it doesn't matter because there are multiple backup measures in place at every levelof the system.
"Failures are not exceptions. Failures are kind of normal," Keshav said, noting that companies like Google or Amazon have dozens of servers "dying every day,"but with upward of100,000 servers on hand, the systems don't crash.
Power outage a 'surprising' cause
Delta AirLines said the cause of Monday's mess was a power outage at its base in Atlanta, Ga., at around 2:30 a.m. ET.In a statement posted online Monday afternoon,the airline said systems were once again "fully operational" andflights had "resumed hoursago but delays and cancellations remain as recovery efforts continue."
The fact that a power outage was to blame is "surprising," Keshav said, because "it's the one thing you wouldn't expect to have happen because that's easy to get right."
Airline data centres usually have two layers of backup diesel generators and batteries to protect "critical systems," he added.
An update from Delta CEO Ed Bastian: pic.twitter.com/udNN0kzbKs
—@Delta
"When you look at a complex computer system such as the one that Delta runs, there's many layers of the cake, so to speak. At the bottom is power," Keshav said.
Mark Duell, vice-president of operations for the global aviation tracking website FlightAware, said airlines "go to great lengths" to make sure backup systems, including severalpower sources,are in place in their data centres.
"Everything from bringing in power from the utility on opposite literal sides of the building, just so [a]single backhoe can't take them both out at the same time;having more generators than they need so that they don't need all the generators to be operable;having...multiple battery backup systems internally to cover everything until the generators come online," Duell said.
"And then down to the point of literally each computer, each server in the data centre is plugged into two different power strips and has two power supplies that are redundant."
Although he doesn't know specifically what happened in Delta's case, Duell saidit was likely that the problem extended beyond a basic utility failure,since the batteries and generator backups should have kicked in.
"It was probably more than one failure," he said.
Safety not at risk
Both Duell and Keshav emphasized that the computer system outage would not have posed a risk to passengers in flight.
"The airplane is entirely independent of the ground in terms of continuing to fly," Duell said.
That's because airlines use "decoupling" in computer system design, Keshav said, meaning systems involved in actually operating the aircraftareindependent from other systems like reservations or flight schedules.
The reason asystem outage like this one has such an impact, Duell said, is because airlines stop and cancel flights for safetyreasons when they can't get access to important computerized information like passenger counts, how much baggage has been checkedor fuelling records.
"You run into those sorts of dependencies where they can't move things, but anything already moving is not in any real danger," he said.
'Critically examine' infrastructure
Delta isn't the only airline to have experienceda recent system failure.
Last month, Southwest Airlines cancelled more than 2,000 flights over several days after an outage that it blamed on a faulty network router.
United Airlines has suffered a series of delays since it merged with Continental as the technological systems of the two airlines clashed.
"It'ssomething that happens from time to time," Duell said."There's no particular airline that is immune to these [problems], and from what we've seen, there's none that are particularly prone to these."
Although Keshavdoesn't know what measures specific airlines have already taken,such large-scalefailurescould bepreventedif theyinvest inrigoroussystems "that tolerate fault and assume faults are going to happen."
But that wouldentail expensiveand complex engineering, requiring the replacement of legacy systemsbuilt years ago, he said.
"Banks, airlines, things like that which have been around for a while... need to at some point critically examine their infrastructure."
With files from The Associated Press