The Cure for Network Downtime is Not Just Technology

Design and tune your network all you want. But if your company doesn’t also have a culture of high availability, your High Availability and Fast Convergence is not complete.

**This blog is a formatting cleanup and update to a previous blog I posted in 2011 on NetworkWorld.

You just finished watching a CiscoLive session from the online CiscoLive On Demand Library and now you want to run and start figuring out the alphabet soup of choices and decisions that is High Availability (HA) and Fast Convergence (FC) – NSR, NSF, GR, BFD, SSO…

Happens all the time whether it be from reading, classes, discussions with fellow engineers, or in my backyard in the Cisco Customer Proof of Concept lab (CPOC)… You take the proverbial magnifying glass and pair it up with your new found knowledge and proceed to give your network a good looking at while asking the question:

“What can be done with this network so that when a failure occurs the transition from failure to recovery happens as quickly as possible?” 

 

So once you figure that out for your network, and implement changes, you are done.  Right?  My opinion?  No, no, no and … uh… no.  You are only partially done.  Why?  Because getting to recovery is about more than just getting the network ready.   

It is about the engineers that touch the network, the engineers that configure the network, the engineers that support the network. It is about your company and/or department’s management and policies.  These are the “supporting layers” if you will, of your network, and therefore impact the overall health and readiness of your network during a failure.

My honest, strong, and personal view?  You need to take that proverbial magnifying glass to the entire picture above, every layer, and ask yourself

“What can be done so that when a failure occurs the transition from failure to recovery happens as quickly as possible?” 

Just as examples

  • Management/Policy: Which is rewarded more in your environment? – Those who put out the fires or those who work day in and day out trying to think about ways to prevent the fires from ever even happening? Translation – are you rewarding reactive behavior or proactive behavior?
  • Documentation of the Network: Is it a high priority for your management and within your networking engineering team to keep your network documentation 100% up-to-date and easy to read and understand? Easy to read and understand by all people (including TAC who has never seen your diagram before) will help get from failure to recovery faster.  Being a Network Detective is much like being a detective. You need an accurate crime scene map (your diagram) and accurate documentation.
  • Single Points of Failure with the Network Engineers: Is there one person who really understands some part of your network and if he or she was to be out of pocket tomorrow and a failure occurred and engineers had to get involved to troubleshoot would this single point of failure impact you greatly getting to recovery?

Why does all this matter? Because all of this impacts your company/department’s “culture of availability”. And that culture and view is going to translate into the availability of your network. Let me mention one more thing. Google the following “network downtime human error” and just take it all in.

Now, how many times have you had a network operation failure that was caused by something that a human did? Or by something you didn’t expect to have happen and didn’t account for before? See that “action” box below? That gets bigger and bigger (delaying getting to recovery) whenever a human (or multiple humans) has to get involved and troubleshoot what is going on.

fishburne-recovery3_1-100273080-orig

The trick is to figure out ways to get this box smaller and smaller and ever smaller.

So your assignment is as follows…. Look at all the failures you have had in the past 24 months in your network. What could you have prevented them? What could you have done to better prepared for them? What could you have repaired faster? Do any similarities or trends pop to mind? Could you have made that time between Failure and Recovery smaller with up-to-date and easier to understand diagrams? With better addressing documentation? Better network management? Better change management policies?

High Availability and Fast Convergence is more than just looking at the physical network and tuning timers. The network relies on you and your team.

Comments are closed.