For Networks and In Life: When Things Go Down (Because They Will)

May 7, 2010 at 12:18 pm 1 comment

Yesterday, the stock market jumped around more wildly than a bull in a rodeo.

This morning, newspapers are reporting a rumor that suggests a human error caused the market to go down. To quote: “Even though there are legitimate factors to market jitters, such as the Greek debt situation, it’s widely believed that a trader’s error—typing in a “b” instead a “m,” creating a trade for $16 billion instead of $16 million—spurred the sell-off.”

The hyper-connected market responded in milliseconds causing a loss of billions of dollars in a matter of minutes. The networked age, more than others, can be particularly unforgiving.

Even outside the Internet, we live in a world of networks. We were served a reminder of our connected world during the Northeastern Blackout of 2003, an evening when I couldn’t see the face of the person mugging me in New York City.

The next morning the sun dawned with an illumination for many: even the world’s greatest city in the world is not invulnerable to an outage. They happen. And as any CIO worth her or his salt will tell you effective disaster management starts with recognizing that disasters happen, and planning for scale and availability.

The Northeastern blackout was in no way an isolated incident.  In our networked world, hardware and software rates are non-negligible and increasing. They have a severe impact. The Yankee Group estimates that one hour of downtime can cost an organization up to $4.5 million – the figure is exponentially higher for powerhouses like Walmart and Amazon.

With so many points of failure, systems cannot be entirely modeled for reliability analysis. It’s especially difficult to predict failures in advance.

And it’s not just the machines that are to blame. Human error by system operators during system maintenance and upgrades are a major cause of system failures.  It’s important that we view these disruptions as inevitable incidents that have to be coped with – rather than problems that can be solved.

Here are some pointers for coping effectively with outages:

  1. Recognize that they will happen.
  2. Emphasize recovery from failures rather than failure-avoidance
  3. Like hardware, datacenters are  single points of failure. When you design systems, do so with this vulnerability in mind.
  4. Isolation and Redundancy are key. A failure in one component should not affect the rest of the system.
  5. In the event of an outage or failure, you should be able to add new systems while increasing performance at a proportional rate.

By taking these simple precautions, companies, cities and entire countries can plan effectively for failure. It’s the very least they can do.

Your subscription to Domain Mapping has expired or is about to expire. Visit the Upgrades page to renew it.

The Pontiflex CPL Blog Domain Mapping (blog.pontiflex.com) June 6, 2010 Upgrade

Hide for 7 days

Entry filed under: Online Advertising. Tags: .

The New Bottom Line: A CFO’s Perspective Privacy Policies: The Subprime Mortgages of the Advertising Industry

1 Comment Add your own

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Reach out to Pontiflex

Feel free to contact us with any comments or questions: info (at) pontiflex (dot) com.

Share The Pontiflex CPL Blog!

Bookmark and Share

Sign up for our newsletter

Recent Posts

Follow Pontiflex on Twitter

Categories


Follow

Get every new post delivered to your Inbox.