For Networks and In Life: When Things Go Down (Because They Will)
May 7, 2010 at 12:18 pm Geoff Grauer 1 comment
Yesterday, the stock market jumped around more wildly than a bull in a rodeo.
This morning, newspapers are reporting a rumor that suggests a human error caused the market to go down. To quote: “Even though there are legitimate factors to market jitters, such as the Greek debt situation, it’s widely believed that a trader’s error—typing in a “b” instead a “m,” creating a trade for $16 billion instead of $16 million—spurred the sell-off.”
The hyper-connected market responded in milliseconds causing a loss of billions of dollars in a matter of minutes. The networked age, more than others, can be particularly unforgiving.
Even outside the Internet, we live in a world of networks. We were served a reminder of our connected world during the Northeastern Blackout of 2003, an evening when I couldn’t see the face of the person mugging me in New York City.
The next morning the sun dawned with an illumination for many: even the world’s greatest city in the world is not invulnerable to an outage. They happen. And as any CIO worth her or his salt will tell you effective disaster management starts with recognizing that disasters happen, and planning for scale and availability.
The Northeastern blackout was in no way an isolated incident. In our networked world, hardware and software rates are non-negligible and increasing. They have a severe impact. The Yankee Group estimates that one hour of downtime can cost an organization up to $4.5 million – the figure is exponentially higher for powerhouses like Walmart and Amazon.
With so many points of failure, systems cannot be entirely modeled for reliability analysis. It’s especially difficult to predict failures in advance.
And it’s not just the machines that are to blame. Human error by system operators during system maintenance and upgrades are a major cause of system failures. It’s important that we view these disruptions as inevitable incidents that have to be coped with – rather than problems that can be solved.
Here are some pointers for coping effectively with outages:
- Recognize that they will happen.
- Emphasize recovery from failures rather than failure-avoidance
- Like hardware, datacenters are single points of failure. When you design systems, do so with this vulnerability in mind.
- Isolation and Redundancy are key. A failure in one component should not affect the rest of the system.
- In the event of an outage or failure, you should be able to add new systems while increasing performance at a proportional rate.
By taking these simple precautions, companies, cities and entire countries can plan effectively for failure. It’s the very least they can do.
Your subscription to Domain Mapping
has expired or is about to expire. Visit the Upgrades page to renew it.
| The Pontiflex CPL Blog | Domain Mapping (blog.pontiflex.com) | June 6, 2010 | Upgrade |
Entry filed under: Online Advertising. Tags: .
1. Ad Serving Momentum – Also, Lack Thereof; Patents And Privacy; Waiting For Display Platform Version 2 | May 10, 2010 at 12:05 am
[...] The Pontiflex blog takes note of the rollercoaster ride of the stock market last Thursday around 2:45 p.m. and being ready for human-caused outages in business. In order to prepare for these outages, the writer suggests a few ideas: "Recognize that [outages] will happen. Emphasize recovery from failures rather than failure-avoidance…" Read the rest here. [...]