The Hitchhiker's Guide to Networking: 2015

OVERVIEW

Let’s face it. Occasionally my broadband will go down and my wife and daughters will go around making wild statements like “The Internet is down”. I just laugh at the thought of that, and shrug it off, and go check Charter’s DNS. Occasionally they will be doing a software upgrade on the broadband modem or there will be a real outage in which case I enjoy the time off.

However, when it comes to critical users such as the Department of Defense (DoD) Command and Control (C2) systems, or financial systems like the New York Stock Exchange (NYSE), we cannot simply shrug it off. These users, or customers, have to be alerted to possible outages so missions can be shifted and or alternate means can be managed to keep the “system” operational. We can spend all day with critical use cases for flying missions or ground missions or even supporting special operations but that is not the point of this white paper.

My experience in the DoD and in the Service Provider (SP) environment has made me very cautious when it comes to marketing terms such as Non Stop Software Upgrade (NSSU) or In Service Software Upgrade (ISSU). This is the kind of marketing that will create a pucker factor of 5 in a SP environment or when supporting the NYSE. (The acceptable pucker scale is 0-5 just in case you were wondering) The reason for the pucker factor scale is the fact that most Network Engineers (NE) enjoy their jobs! Everyone has had that one upgrade where something went wrong or did not work as expected. The quiet and calm of the mid-shift suddenly breaks with one phone ringing then 5 then 20 and before you know it your bosses boss’s boss is talking to you directly asking what you are doing to fix the problem that you created. (Sound familiar?)

I don’t know of anyone who wants to take that call from the Director or CEO asking why a router is down.

Now imagine you have no console access to that router/node/device and that a truck roll (or even worse a helicopter ride) is required and it will take at least 2 hours to recover. In the financial world millions can be lost in minutes and in the DoD world you could be responsible for the lives of sons or daughters of your neighbor, or even your own son or daughter.

Okay, enough of painting a picture of just how critical it is to have a good game plan before an upgrade.

UPGRADING

In Service Software Upgrade (ISSU) is marketed as hitless. I do not know of anyone who has ever seen a hitless upgrade. Maybe in a lab with nothing running across the device or without users but I can’t remember one node upgrade that has ever been hitless and let me tell you I have upgraded a lot of nodes in my time! I have queried numerous friends in the network world and asked them the same thing and the resounding reply is a big NO.

Now with that being said have I seen remote devices be upgraded and come back on their own after initiating an ISSU? YES! But it is never hitless as there is always a second or two lost or some packets dropped.

Let’s take a simple network with single nodes at remote locations supporting single homed users.

Assume that the PE on the left supporting Customer Edge routers, directly attached users and Firewalls will be upgraded at midnight and assume that there is not remote access to that PE (Really bad planning BTW as 4G wireless can fix not having dial-up or a dedicated out of band network). Let’s look at the steps we need to take to upgrade this node with minimal downtime for our customers.

DOCUMENT – exactly what your are going to do from start to finish and then have all the stakeholders sign off on it to ensure they have read through it. If done properly obstacles that were once unknown will come to light.
TEST - in the lab make sure you understand your own processes and procedures.
COORDINATE – make sure that you have provided ample time to your customers to allow them to plan on their end. Often you will have to delay or reschedule because your customer has an important mission, critical backup or other requirement that will push your date.
COLLECT – all the data you possibly can prior to the upgrade. Capture all of the configurations, system health checks, hardware information, routing tables, interface statistics etc.
PLAN – on the worse case scenario. IF is the start of all logic and logic will get you through the upgrade.

IF the upgrade goes south THEN
IF there is a hardware issue THEN
IF there is a power problem THEN
IF testing didn’t account for X THEN

PREPARE – train like you fight is my motto. I would run this through the lab (if you are lucky enough to have one) using the production configuration and new image that will be deployed.
RECOVERY PLANS – are very important. You want to make sure that if things do go wrong, and all the pre planning and testing has failed, that you have a way to back out of the upgrade. (This happens more often then one would think!).
MAKE IT HAPPEN – you have done your due diligence with all the documentation, pre-planning and testing and IF you have done your job properly THEN you should be able to proceed with the upgrade.
POST-TESTING – is very important and should be accomplished to verify that every interface that was operational prior to the upgrade is working once the upgrade is completed (There will be those customers who downed and interface or were trying to troubleshoot their connections and those interfaces will need a personal touch to find out why they are down.
CLOSURE – once you have completed your upgrade and post testing phases it is still important to watch the logs for any new messages that may indicate problems. Upgraded operating systems often bring new messages and deeper information on the performance of hardware and connections that older versions did not. Its important to capture and understand these new messages and if there is any impact to the node or network.

Now let’s discuss how the 10 steps outlined above would work if the users in a C2 or SP environment were dual-homed or if there were more than one PE device in a geographically separated location.

You would still want to perform all 10 steps above but in step 8 “MAKE IT HAPPEN” you would add a couple of steps.

8a. Overload OSPF to move transit traffic from the node

8b. Cost your routing protocols to force traffic to one node while you work on the node you are upgrading

8c. Utilize any network methods to force your customer/user traffic from the node being worked on to the sister node in that region.

These steps will minimize the impact diameter of your upgrade.

You will almost always have that one user that is single homed to a specific PE in that region and you will have to make sure that they are a priority when you complete the upgrade.

You still have to perform the 10 steps but for Critical customers like the FW and CE devices in the figure above they should not lose any communications during the upgrade of a single PE. If that user that is single homed to your network but is also homed to another network is aware of the outage then they would also have communications during the outage window of the PE.

SUMMARY

My personal experience in critical networks and in Service Provider environments is that multiple devices are used in a geographically separated region to provide redundancy and survivability of end user communications. I would never rely on something that has the potential of going really bad, really fast! In all the environment I have worked (DOD, Service Provider, Cloud Providers and Financial environments) my main goal is to reduce that pucker factor down to a comfortable 0!

The Hitchhiker's Guide to Networking

Friday, December 18, 2015

Zero Touch Provisioning

Thursday, December 17, 2015

UPGRADING AT THE PROVIDER EDGE

About Me

Blog Archive