news / tech talk

Uptime

by Lee LeClair
06/20/2008
As seen in Inside Tucson Business

As businesses have become dependent on their computers and network systems, the importance of “uptime” has become critical. No one can afford to be “down” these days, it directly affects customers and operations. So if you need your systems and networks to be up all the time, ensure you take the following things into consideration: power, simplicity, redundancy, and continuity of operations. In our discussion, we are not going to assume you are planning for disaster recovery, just very reliable uptime.

A first consideration is power as it is the crux of all IT equipment. Having reliable power means taking the occasional power hiccup into account as well as the ability to deal with longer outages. We are fortunate in Arizona to have pretty reliable power and not too many sources of environmental outage (tornado, hurricane, earthquake, flood, etc.). Nevertheless, we are not immune and some of you may have suffered outages from trees blown over during a storm, vehicle accidents with power poles, and of course surges from lightning storms. Be sure that critical systems are identified and sit on fuse protected Uninterruptible Power Supplies (UPS). These can be multiple small units or larger units that can handle multiple systems. Note that most small UPS units are only good for a 10-15 minutes of load and that their performance degrades over time; their internal battery packs will need to be replaced in 3-5 years. These types of systems need to be sized appropriately to your system loads and all systems that you want to stay up need to be on them (network components too). While these will deal with short outages, they will not provide any lengthy uptime. For that, you will need a generator of some type. Generators are available for diesel, gas, and natural gas. At this level, you may want to consult with a electrical power engineer or electrician to ensure your generator is sized correctly and all systems are connected correctly. Test it at least quarterly.

The next consideration is simplicity. Ensure your network and systems are designed to operate in a simple and understandable way with clear and up to date documentation. This will help immensely with keeping the network and systems operational. When problems do occur, they will be much easier to troubleshoot if you stay away from overly complex designs. Also, if you have IT staff turnover, a simple design and good documents makes the system easier to understand for transition to newbies. For campus area networks, go with a very simple layer 3 core and layer 2 only at the edges.

Now consider where you need redundancy. This will vary based on the size and criticality of various points in your network and systems. Adding redundancy can be easy (e.g., dual power supplies in a various equipment) but it can complicate things as well (redundant network components in a load-balancing or failover configuration). Keep the “simple” rule in mind and add redundancy where it makes the most sense. Where you put redundant network systems in place, investigate the redundancy protocols being implemented (GLBP, HSRP, VRRP, etc.) and tune the timers to keep uptime as high as possible.

Finally, plan continuity of operations scenarios. Think through the most likely scenarios, prioritize what will need to be done, and write down the information you and your staff will need (power company phone number, hardware manufacturers, warranty info, etc.) as well as the procedures (e.g., system recovery procedures). Then schedule practice times and tests of these scenarios. You would not want to have to try to swim after having only read about it, right? In real downtime situations, stress will be high so practiced checklist responses are the best way to get your systems back to where they need to be.

Lee Le Clair is the CTO at Ephibian. His Tech Talk column appears the third week of each month in Inside Tucson Business