Posted on Tuesday, May 2nd, 2006 | Bookmark on del.icio.us

The Enemy Within

by Craig Labovitz

Amidst the daily headlines heralding the rise of massive zombie armies and malevolent teenage hackers in far-off countries, it’s easy to lose sight of one of the biggest threats to your network.

And you may be surprised by this oft overlooked enemy to uptime and your SLAs. (Hint: It isn’t the thousands of compromised PCs waiting to launch DDoS on your customers’ servers.)

And nope, it is not the next generation of zero-day worms.

Isn’t even an external hacker.

It’s you — the well-intentioned network engineer or security admin.

Repeated studies through the last two decades have shown that most network outage hours (mean time to repair) are due to poor change management control. Specifically, lack of a priori testing, missing change back out support, and incomplete upgrade plans cause far more downtime than any external security threat.

While accurate enterprise and ISP failure statistics are notoriously hard to come by, data from a 1997 study of failures in a regional Internet provider still mirrors the experience of most tier one/two ISP engineers today: routes go down when you change the hardware (upgrade cards) or change the routing/security policies. Malicious attacks as a source of outage hours is in the noise, by comparison.

Results of one year study of network failures

You can see similar evidence of this trend in graphs of long-term BGP instability. The most stable time of the year, every year, is during the mandatory ISP network change freeze period around late December to mid-February.

BGP Instability

Now, things aren’t as bad as in the wild-west days of the Internet boom and subsequent bust — I have not heard Bob Metcalfe utter the word Gigalapse in at least five years, and you can actually get both security features and forwarding performance from your router vendor in the same release (well, unless you are running MPLS). But, outages due to maintenance are still commonplace.

Some operators have a sense of humor about these outages — one US provider has an old nautical bell of shame that hangs in the cube of the last engineer to break the backbone. I’m told there is even a short ceremony that accompanies the transfer and subsequent solemn ringing of the bell to the cube of the next hapless guilty engineer.

While ISP outages in the past may have been a source of grim humor, the stakes are changing. In particular, E911, VoIP, and threat of FCC/EU regulatory reporting requirements change the nature of the game.

As an industry, we know how to achieve reliability in computer networks: never change the network. Demonstrating the success of this strategy, there are Tandem non-stop servers that have been running without incident (and significant change) for the last fifteen years. The real challenge, of course, is how to balance the need for change with that of reliability. This change management trade-off echos a similar balance between security and usability.

In truth, network management and security are two sides of the same coin. Both require preparation and a priori understanding of policies and network topology/performance. And, both require situational awareness when things go wrong.

Despite the close linkage between security and network management, a recent Arbor survey shows most providers continue to maintain two separate groups. Often the divide between these groups is arbitrary, and, as a vendor, we’ve often served the role as mediator, bridging the chasm of ill-trust and suspicion between these two teams in enterprise and ISP accounts.

I’m not trying to minimize the idea of worms, zombies and hackers as real threats; they are. You should be concerned. My point is more that seemingly mundane terms of MTR, MTF and change management policy deserve at least as much consideration as Sasser, Zotub and their ilk.

Popularity: 6% [?]

Leave a Comment