Take a look at my Juniper Day One book on Zero Touch Provisioning (ZTP) that was just released in Dec.
http://www.juniper.net/us/en/training/jnbooks/day-one/networking-technologies-series/deploying-zero-touch-provisioning/
Day One books are great reads for the network engineer who is looking to pick up a new technology or learn how to configure something in a short period of time. They are designed to be read in one day (Hence Day One) and provide the information for the reader to take it from there.
http:/www.juniper.net/dayone
-Scott
This is the non-fiction collection of daily drivel of a mad Network Engineer.
Friday, December 18, 2015
Thursday, December 17, 2015
UPGRADING AT THE PROVIDER EDGE
OVERVIEW
Let’s face it. Occasionally my broadband will go down and my
wife and daughters will go around making wild statements like “The Internet is
down”. I just laugh at the thought of that, and shrug it off, and go check
Charter’s DNS. Occasionally they will be doing a software upgrade on the
broadband modem or there will be a real outage in which case I enjoy the time
off.
However, when it comes to critical users such as the
Department of Defense (DoD) Command and Control (C2) systems, or financial
systems like the New York Stock Exchange (NYSE), we cannot simply shrug it off.
These users, or customers, have to be alerted to possible outages so missions
can be shifted and or alternate means can be managed to keep the “system”
operational. We can spend all day with critical use cases for flying missions
or ground missions or even supporting special operations but that is not the
point of this white paper.
My experience in the DoD and in the Service Provider (SP)
environment has made me very cautious when it comes to marketing terms such as
Non Stop Software Upgrade (NSSU) or In Service Software Upgrade (ISSU). This is the kind of marketing that will
create a pucker factor of 5 in a SP environment or when supporting the NYSE. (The acceptable pucker scale is 0-5 just in
case you were wondering) The reason for the pucker factor scale is the fact
that most Network Engineers (NE) enjoy their jobs! Everyone has had that one
upgrade where something went wrong or did not work as expected. The quiet and
calm of the mid-shift suddenly breaks with one phone ringing then 5 then 20 and
before you know it your bosses boss’s boss is talking to you directly asking
what you are doing to fix the problem that you created. (Sound familiar?)
I don’t know of anyone who wants to take that call from the
Director or CEO asking why a router is down.
Now imagine you have no console access to that
router/node/device and that a truck roll (or even worse a helicopter ride) is
required and it will take at least 2 hours to recover. In the financial world
millions can be lost in minutes and in the DoD world you could be responsible
for the lives of sons or daughters of your neighbor, or even your own son or
daughter.
Okay, enough of painting a picture of just how critical it
is to have a good game plan before an upgrade.
UPGRADING
In Service Software Upgrade (ISSU) is marketed as hitless. I
do not know of anyone who has ever seen a hitless upgrade. Maybe in a lab with
nothing running across the device or without users but I can’t remember one
node upgrade that has ever been hitless and let me tell you I have upgraded a
lot of nodes in my time! I have queried numerous friends in the network world
and asked them the same thing and the resounding reply is a big NO.
Now with that being said have I seen remote devices be
upgraded and come back on their own after initiating an ISSU? YES! But it is
never hitless as there is always a second or two lost or some packets dropped.
Let’s take a simple network with single nodes at remote
locations supporting single homed users.
Assume that the PE on the left supporting Customer Edge
routers, directly attached users and Firewalls will be upgraded at midnight and
assume that there is not remote access to that PE (Really bad planning BTW as
4G wireless can fix not having dial-up or a dedicated out of band network).
Let’s look at the steps we need to take to upgrade this node with minimal
downtime for our customers.
- DOCUMENT – exactly what your are going to do from start to finish and then have all the stakeholders sign off on it to ensure they have read through it. If done properly obstacles that were once unknown will come to light.
- TEST - in the lab make sure you understand your own processes and procedures.
- COORDINATE – make sure that you have provided ample time to your customers to allow them to plan on their end. Often you will have to delay or reschedule because your customer has an important mission, critical backup or other requirement that will push your date.
- COLLECT – all the data you possibly can prior to the upgrade. Capture all of the configurations, system health checks, hardware information, routing tables, interface statistics etc.
- PLAN – on the worse case scenario. IF is the start of all logic and logic will get you through the upgrade.
- IF the upgrade goes south THEN
- IF there is a hardware issue THEN
- IF there is a power problem THEN
- IF testing didn’t account for X THEN
- PREPARE – train like you fight is my motto. I would run this through the lab (if you are lucky enough to have one) using the production configuration and new image that will be deployed.
- RECOVERY PLANS – are very important. You want to make sure that if things do go wrong, and all the pre planning and testing has failed, that you have a way to back out of the upgrade. (This happens more often then one would think!).
- MAKE IT HAPPEN – you have done your due diligence with all the documentation, pre-planning and testing and IF you have done your job properly THEN you should be able to proceed with the upgrade.
- POST-TESTING – is very important and should be accomplished to verify that every interface that was operational prior to the upgrade is working once the upgrade is completed (There will be those customers who downed and interface or were trying to troubleshoot their connections and those interfaces will need a personal touch to find out why they are down.
- CLOSURE – once you have completed your upgrade and post testing phases it is still important to watch the logs for any new messages that may indicate problems. Upgraded operating systems often bring new messages and deeper information on the performance of hardware and connections that older versions did not. Its important to capture and understand these new messages and if there is any impact to the node or network.
Now let’s discuss how the 10 steps
outlined above would work if the users in a C2 or SP environment were
dual-homed or if there were more than one PE device in a geographically
separated location.
You would still want to perform
all 10 steps above but in step 8 “MAKE IT HAPPEN” you would add a couple of
steps.
8a.
Overload OSPF to move transit traffic from the node
8b.
Cost your routing protocols to force traffic to one node while you work on the
node you are upgrading
8c.
Utilize any network methods to force your customer/user traffic from the node
being worked on to the sister node in that region.
These steps will minimize the
impact diameter of your upgrade.
You will almost always have that
one user that is single homed to a specific PE in that region and you will have
to make sure that they are a priority when you complete the upgrade.
You still have to perform the 10
steps but for Critical customers like the FW and CE devices in the figure above
they should not lose any communications during the upgrade of a single PE. If
that user that is single homed to your network but is also homed to another
network is aware of the outage then they would also have communications during
the outage window of the PE.
SUMMARY
My personal experience in critical
networks and in Service Provider environments is that multiple devices are used
in a geographically separated region to provide redundancy and survivability of
end user communications. I would never rely on something that has the potential
of going really bad, really fast! In all the environment I have worked (DOD,
Service Provider, Cloud Providers and Financial environments) my main goal is
to reduce that pucker factor down to a comfortable 0!
Subscribe to:
Posts (Atom)