Network Management
What it is and what it isn't.
By Douglas W. Stevenson
dstevenson@intracom.com Apr 1995
Table of Contents
- Introduction
- Functional Architecture
- Defining the Pieces
- Managed Objects
- Element Management Systems (EMS's)
- Manager of Managers Systems (MoM's)
- User Interface
- Management Functional Areas (MFAs)
- Fault Management
- Configuration Management
- Accounting
- Performance Management
- Security
- Common Implementations
- Management Focus
- The Right Implementation
- Business Case Requirements
- Definition
- System Focus
- Reporting of Trend Analysis
- Alarm Correlation
- Trouble Ticket Integration
- What Happens Now that I've Received an Alarm?
- Systems Automation
- Enabling Communications
- Building the Perfect Beast
- Management Functional Domains (MFD's)
- Building Requirements
- Questions to Ask
- Conclusion
Network Management as a term has many definitions dependent on whose operational
function is in question. It is the goal of this paper to illustrate and discuss
today's most common implementations of Network management systems as they apply
to actual MIS form and function and illustrate a What's wrong with this
picture type of scenario. Then discuss what the ideal system will look like.
Network management systems have been in operation many years especially in
their own proprietary worlds such as Netview, AT&T Accumaster and Digital
Equipment Corporation's DMA. With the implementation of SNMP, local area and
wide area network components could be monitored and "managed". With
the vast amount of raw data available, most MIS Managers have no idea what they
really want because, in part, they don't know what's available. Additionally,
how does the data get into a format that actually means something? Other
communications systems are considered non-manageable because they are only
accessible by an RS-232 port and not by Netview or SNMP. Others tend to believe
that Network Management means nothing but the monitoring and management of
network architectural hardware such as Routers, bridges and concentrators --
nothing above the network layer of the OSI model is considered manageable.
What's alarming is that most Senior Network Engineers tend to be resigned to
spend thousands of dollars on hardware and software BEFORE the real requirements
are gathered and defined. Consequently, MIS departments either spend very little
on network management or they "go for broke" with the huge hardware
platforms and expensive artificial intelligence engines driving network
management for the company.
In today's environment of cost cutting and productivity enhancements, most
common network management implementations increase the number of people required
to support the MIS functions and these new people are senior level engineering
and support types; very expensive in most cases. Typical costs extend into the
hundreds of thousands of dollars purchasing hardware and software not to mention
the additional personnel.
Network management systems have to be geared toward the work flow of the
organization in which they will be utilized. As each MIS implementation is
geared toward the business requirements, so should the network management
system. If the management functionality does not directly or indirectly solve a
business problem, it is totally useless to the overall MIS department and to the
company.
Network management doesn't mean one application with a database with some
huge chunk of iron running the show. It is really an integrated
conglomeration of functions that may be on one machine but may span thousands of
miles, different support organizations and many machines and databases. It is
these functions that must be directly driven by the business case for each.
Network management systems have four basic levels of functionality. Each level
has a set of tasks defined to provide, format, or collect data necessary to
manage the objects. Figure 1 illustrates these four levels of functionality.

Figure 1
Managed Objects are the devices, systems and/or anything else requiring some
form of monitoring and management. Most implementations leave out the
"anything else" clause because they usually don't have the business
case requirements before the design, therefore they design as they go.
Some examples of managed objects include routers, concentrators, hosts,
servers and applications like Oracle, Microsoft SMS, Lotus Notes, and MS Mail.
The managed object does not have to be a piece of hardware but should rather be
depicted as a function provided on the network.
An EMS manages a specific portion of the network. For example SunNet Manager, an
SNMP management application, is used to manage SNMP manageable elements. Element
Managers may manage async lines, multiplexers, PABX's, proprietary systems or an
application.
MoM systems integrate together the information associated with several element
management systems, usually performing alarm correlation between EMS's. There
are several different products that fall into this category to include Boole
& Babbage's CommandPost, NyNEX AllLink, International Telematics MAXM, OSI
NetExpert and others.
The actual data to be collected comes from the managed object, in most cases.
This data is collected by the EMS systems which in turn consolidates the data in
a database for processing and retrieval.
The user interface to the information, whether real time alarms and alerts or
trend analysis graphs and reports, is the principal piece to deploying a
successful system. If the information gathered cannot be distributed to the
whole MIS organization to keep people informed and to enable team
communications, the real purpose of a Network Management system is lost in the
implementation. Data doesn't mean anything if it is not used to make informed
decisions about the optimization of systems and functions.
These systems components are, in turn, mapped back to what is called
Management Functional Areas (MFAs). These MFAs are the wish list of which areas
in which management applications as a system focus their attention.
The most common framework depicted in Network management designs is centered
around the Open Systems Interconnect (OSI) "FCAPS" model of MFAs.
However most network management implementations do not really cover all of these
areas. Other areas that may be important to the MIS function and to specific
business units within the company may not be addressed at all.
FCAPS is an acronym explained as follows:
Fault Management
Configuration Management
Accounting
Performance Management
Security Management
Some of the other areas covered under Management Functional Areas include:
Chargeback
Systems Management
Cost Management
Fault management is the detection of a problem, fault isolation and correction
to normal operation. Most systems poll the managed objects search for error
conditions and illustrate the problem in either a graphic format or a textual
message. Most of these types of messages are setup by the person configuring the
polling on the Element Management System. Some Element Management Systems
collect data directly from a log printer type output receiving the alarm as it
occurs.
Fault management deals most commonly with events and traps as they occur on
the network. Keep in mind though, that using data reporting mechanisms to report
alarms or alerts is the best way to accomplish health checks of specific managed
object's performance without having to double the amount of polling being
accomplished.
Configuration management is probably, the most important part of network
management in that you cannot accurately manage a network unless you can manage
the configuration of the network. Changes, additions and deletions from the
network need to be coordinated with the network management systems personnel.
Dynamic updating of the configuration needs to be accomplished periodically to
ensure the configuration is known.
The accounting function is usually left out of most implementations in that LAN
based systems are said to not promote accounting type functions until one gets
into the Hosts such as IBM Mainframe or Digital VAX's. Others rationalize the
accounting is a server specific function and should be managed by the System
administrators.
Performance is a key concern to most MIS support people. Although, it is high on
the list, it is considered difficult to be factual about some LAN performance
issues unless employing RMON technology. (This is one of those examples of
throwing money at a problem.) Although RMON Pods are very useful, one should
carefully weigh what's pertinent to what can be accomplished in other ways
without having to spend a bundle.
Performance of Wide Area Network (WAN) links, telephone trunk utilization,
etc., are areas that must be revisited on a continuing basis as these are some
of the areas easiest to optimize and realize savings.
Systems or applications performance is another area in which optimization can
be accomplished but most network management applications don't address this in a
functional manner.
Most network management applications only address security applicable to network
hardware such as someone logging into a router or bridge. Some network
management systems have alarm detection and reporting capabilities as part of
physical security (contact closure, fire alarm interface, etc.) None really deal
with system security as this is a function of System administration (or so you
thought!).
- Chargeback
- Chargeback has been done for years in the large mainframe environments and
will continue to be accomplished as it is a way to charge the end user for
only the specific portion of the service that he or she uses. Chargeback on
Local Area Networks presents new challenges in that so many services are
provided. In many implementations, chargeback is accomplished on the
individual Server providing the service. While chargeback is very difficult
on broadcast based networks such as Ethernet, it is realizable on networks
that dynamically allocate bandwidth as the end users' needs dictate (ATM).
As technology associated with monitoring LAN and WAN networks evolves,
chargeback will be integrated into more and more systems.
- Systems Management
- Systems Management is the management and administration of services
provided on the network. A lot of implementations leave out this very
crucial part in that this is one of the areas in which Network Management
systems can show significant capabilities, streamline business processes,
and save the customer money with just a little work. There are many good
COTS products available to automate system administration functions and
these products can be easily integrated into the overall Network Management
system very easily.
- Cost Management
- Cost management is an avenue in which the reliability, operability and
maintainability of managed objects are addressed. This one function is an
enabler to upgrade equipment, delete unused services and tune the
functionality of the Servers to the services provided. By continuously
addressing the cost of maintenance, Mean Time Between Failure (MTBF), and
Mean Time To Repair (MTTR) statistics, costs associated with maintaining the
network as a system can be tuned. This area is an MFA that is driven by I/T
management to address getting the most performance from the money allocated.
Most implementations of medium and large network management systems center
around a Network Management Center of some sort. From this location, all data is
sent and processed. While several EMS's are used to manage their specific areas,
all of the data comes back to the Manager of Managers application. Most fault
detection, isolation and troubleshooting is accomplished in the Network
Management Center and technicians dispatched when the problem has been analyzed
as far as possible. Several company locations may be involved in the overall
network spanning thousands of miles and around the globe. 
Figure 2
The management focus for this scenario is on the Network Management Center
driving the total operation. Detection, troubleshooting and dispatching is
accomplished from the NMC. This operational focus is a carry over from the old
Netview days in that the center of the picture was a huge IBM Mainframe that did
all of the work. If you don't have a Network Management Center today, consider
what it will cost not only for the hardware and software, but the people to
accomplish this and their level of expertise.
If you, as an MIS Manager, are looking at the benefits of network management to
reduce downtime and overall cost to your program, make sure that the business
case requirements drive the implementation and not the implementation drive the
business cases.
As a systems integrator, make sure the requirements are accomplished before
any implementation. When the requirements are put in place, it is your job as an
Engineer to make sure management is informed as to what each implementation
segment will cost along with what that capability brings to the overall MIS
function.
In today's world, any implementation must follow the business case associated
with what will be implemented. The implementation must solve a business problem
or increase efficiency of the current methods of accomplishing work while
reducing overall costs. If the solution doesn't save money while providing a
better service, it probably isn't worth accomplishing.
The hardest part of building a business case is the gathering of the
information. One must define the problem at hand in a general sense so that you
can look for specific problems network management can address in that area.
The developer of the business case must look at the current way each section
accomplishes its day to day work. The case for network management can be
definitized by documenting current work processes that may be automated by the
system as a whole. Each of the work processes to be automated need to be
documented and addressed in the system design and implementation.
Look for ways to save the organization money. Keep addressing getting the MIS
organization and the services they provide, more efficient.
Levels of Activity
There are four levels of activity that one must understand before applying
management to a specific service or device. These four levels of activity are as
follows:
- Inactive
- This is the case when no monitoring is being done and if you did receive
an alarm in this area, you would ignore it.
- Reactive
- This is where you react to a problem after it has occurred yet no
monitoring has been applied.
- Interactive
- This is where you are monitoring components but must interactively
troubleshoot to eliminate the side effect alarms and isolate to a root
cause.
- Proactive
- This is where you are monitoring components and the system provides a
root cause alarm for the problem at hand and automatic restoral processes
are in place where possible to minimize downtime.
These four levels of activities outline exactly how your support organization is
dealing with problems today and where you, as an MIS manager want them to be in
terms of goals. Within the support organization are teams with different goals
and focuses (i.e. Unix support, desktop support, network support, etc.). Keep in
mind that while a specific alarm may warrant an inactive approach by one team,
to another team it may demand a proactive approach. Keep these goals in mind
when gathering requirements for network management.
Today's Implementations
Of the network management implementations done today, very few really address
the needs of the business. Most are implemented with good intentions but are
focused away from increasing efficiency.
In a multiple site network, there are technicians, engineers and support
personnel at each major location as required. No one knows those local
environments better than the people having to do the work. No one knows the
people of the organization better than the Help Desk staff as they are the first
line of communication between the people and the MIS support organization.
Network management elements are considered, among other things, tools in
which troubleshooting can be accomplished. The local support staff could benefit
greatly from the use of these systems as a tool. As such, most implementations
give read-only access to these systems. The ability to focus these tools at a
local level is paramount to increasing the effectiveness to the local support
staff. In some implementations, where read/write access is provided, it is
accomplished through X-Windows which doesn't work very well across low speed
links.
Most implementations focus these tools at a global level in that they are
located in the Network Command Center. When a trouble ticket is generated from
the NCC, it reflects a problem or symptom generated by the network management
elements and/or the Manager of Managers. Sometimes, the local technician can not
relate to this symptom because he or she doesn't understand where this message
came from or why. Without access to the management element and familiarity with
the product, they usually start off problem isolation in a "cloud"
looking for the problem.
When a global problem occurs, in these scenarios, the information is
concentrated and orchestrated by the Network Command Center. Additionally an
outage can black out management of a geographic location by centralizing the
management resources. Figure 3 illustrates how this occurs.

Figure 3 As far as the Network Management Center is concerned, all of
the devices beyond the point of breakage are down. In fact, without alarm
correlation, all of the devices will be depicted as bad. Even with alarm
correlation, it can only be accomplished on one side of the link. No network
management capabilities exist at the remote site to help troubleshoot the
problem.
The ideal network management system should be designed and implemented around
the real work processes. It should focus the tools toward those staff members
supporting the managed area in a manner which makes their job easier and faster.
Information associated with a problem or symptom should mean something to the
support personnel. If they see the problem at a glance, they should know which
specific area that problem belongs and what to do to get started in the trouble
isolation process. Other personnel in the organization should know that a
specific technician is looking into the problem as the problem may be affecting
other areas.
Help Desk personnel should know what is happening and who is working on what
at a glance. If they are not familiar with the system in question, they should
have adequate information at their fingertips to guide them in what to do, who
to call, and what steps to take, even what questions to ask.
Additionally, the problems that affect other sites, should be available to
those personnel at a glance. The information must be at the fingertips of the
other sites' Help Desk personnel so that they know, in near real time, what is
going on.
See how the focus of information should be; local when it is a local problem
and global when it is a global problem. Also, the tools associated are more
focused on the local situation and not the global picture.
Figure 4 depicts a more distributed system providing global information with
local focus. In this system, alarms can be passed from site to site and even
around a problem with simple client-server database techniques.

Figure 4 In the scenario in figure 4, if a link breaks, local tools and
alarms are still available. Alarms concerning the overall health of other links
and connectivity can be passed to other sites, even around a problem. Using a
SLIP or PPP dial up link between management elements can be used to pass
critical data about a link outage in near real time.
Network management across low speed wide area links doesn't really make
sense. Bandwidth of this type is costly compared to LAN bandwidth in that there
are the monthly charges for the links. Consider also that most WAN links are
interconnected by bridges or routers. On the back side of these devices are
networks capable of 10 Mbps, 16 Mbps or even 100 Mbps. On the link side you see
1.544 Mbps, 512kbps or even 19.2kbps links. Actual polling of network management
elements (SNMP) could consume these links drastically reducing the operational
capabilities of the link. The question to ask is Do you want to increase the
bandwidth across these links just for network management or do you want to
distribute the management polling to local area concentrations and just pass the
real alarm information?
Trend analysis is usually a local function as one is looking for growth rates on
local hardware, applications and systems. Only when the Wide Area Network is
trended does the information require analysis between multiple sites. Even then,
local or remote changes can affect each others' environment.
The personnel that should be accomplishing the trending are the people
actually accomplishing the work; again no one knows the environment better than
those personnel. Reporting needs to be accomplished on an as needed basis
because each report needs to be in a format the local support personnel can
understand. Therefore, calculations must be available to simplify data in the
reports including averages, percentages and comparisons. Each type of report
needs to be customizable and easy to change.
Specific areas of reporting are very useful in looking at the overall
implementation. Network availability is an excellent method of looking at
specific areas when implemented at a low level, i.e., by object. There are
several methods in which this can be accomplished in ways that allow the IS
staff to effectively manage the assets.
Most availability reports concentrate only on seeing if the box is there for
a specific time period and then calculating the time not available back to the
total number of time units per the month. Sometimes averages of a few objects
are lumped together to produce a usable sum. The truth is, most of these types
of availability reports don't do anything constructive but pacify upper
management. If the data for availability focused instead on a weighted metric
depicting importance of the service provided and what was actually happening
during downtime, such as scheduled maintenance, unscheduled maintenance, lost
connectivity due to something else failing, definitive actions could be taken to
circumvent some of the problems. Effectively, network availability is an
excellent tool to "raise the flag" when a specific service is becoming
unreliable.
Most implementations use a network availability formula similar to the
formula shown in figure 5. This formula is usually geared toward specific
devices on the network or the availability of a trunk. Notice that the more
devices added into the overall calculation, the more obscured the calculation
becomes in that one considers all the devices on the same level as others and
furthermore, the more devices added into the overall average, the more hidden
they become.
This is accomplished for each device, then averaged as a group.

Typical Method of Calculating Availability
Figure 5 Consider a Server that is plagued by problems and achieves an
actual availability of 20%. If 99 other devices are added into the calculation
with each of those achieving 100% availability, the real problem area is
obscured. The availability of a device or service is used to identify problem
areas so that they can be corrected. It is not to pacify management showing good
high numbers when the actual service that has been a problem, is considered 100%
available!
Another method of accomplishing availability is to gather a list of services,
provided on the network, by priority. Report on the availability of each of the
services on a monthly basis. Use a modifier or weighting on those services that
are considered more important to the organization. Telling management the truth
about the availability of services provides an avenue to correct those things
that are having problems and provide better services to the end user community.
In the formula figure 6, one can see how specific services can be weighed
according to importance to the business units.

Example Method reported by Service
Figure 6
Response Times Reporting
The response time associated with specific network services is really important
to the level of service the end user receives. Response time across the network
also affects how well certain protocols and interfaces perform such as NFS,
X-Windows and Client/Server implementations using RPC mechanisms.
LAN/WAN
One of the big misconceptions of Routers is that if you have a T1 link (1.544
Mbps) attached to an interface, you can actually sustain a full link in data
throughput. Routers never really utilize a link to 100% but rather 70 to 80% is
a better figure. When utilization goes up on the link actual utilization does
not. The response time does, however, along with buffer utilization. By
monitoring the actual utilization and correlating this data back to buffer
utilizations and the response times across the interface, one can derive a much
more informed picture of the actual link utilization.
Another misconception in measuring response time is the use of ICMP ping
statistics. Because ICMP echo requests and responses are probably dead last on
the priority in which protocols are serviced on most boxes, the data collected
through pings may or may not be accurate dependent upon how busy the device was
at that particular instant in time. A much more accurate method of collecting
valid response time data is using SunNet Manager's proxy MIB "ippath"
or using traceroute which is available in the public domain.
Inversely, one can monitor ICMP Source Quenches to see if the interface is
being flooded or the system can not respond quickly enough for the data coming
in. This specific problem is common to Unix Servers that do not have enough swap
space or are sized to small for the applications services they provide.
Some RMON devices can provide statistics on the interpacket delay between two
nodes on the network. This is especially handy when monitoring protocols other
than IP such as Novell's IPX/SPX.
Routers are an excellent source of echo response data provided one can script
through the process with either a console port attachment or via Telnet. For
example, Cisco routers can ping a device using the Appletalk protocol.
SNA/Netview
Response times measurements have been an important feature to monitoring the
health of SNA networks for years. Not only terminal to host response times could
be monitored -- application response times, DASD (Disk drive) response times and
host to host response times could be monitored and reported.
Electronic Mail
Electronic mail typically uses a store and forward methodology to exchange data
across the network. Additionally, many implementations use gateways between
disparate mail systems so that end users may exchange mail across computing
environments. The ability to measure the time taken to send a message across a
system or gateway is very important to measuring the health and status of the
electronic mail as a total system. There are third party systems being marketed
today that accomplish just this task, like Baranoff
Mailcheck.
Applications
Some applications have audit trails associated with them to allow someone to
monitor performance and response time. These applications, like Oracle, Sybase,
Informix, keep transaction tables that can be parsed and used to measure
performance.
There are applications available today that will monitor applications
performance on the Server. These applications typically provide an avenue to
monitor an applications performance on a server and report problems.
Additionally, they organize the available data associated with the actual
resource utilizations so that systems personnel can keep the service at an
optimum performance level.
Network Utilization Reporting
What about network utilization reports? Most network management systems,
especially SNMP managers take one MIB variable and plot the delta. Who ever
thought of comparing an overall link utilization with the types of protocols and
errors occurring over the same link. Network utilization reports let the local
personnel plan for capacity of systems, links and segments. Networks can be
optimized readily from the data provided in utilization type reports. All the
data in world isn't any good unless you can compare it to other elements as
required. Furthermore, these reports need to be accomplished on a local level so
what if type scenarios can be accomplished for best results.
Network utilization can be measured from SNMP based managed objects using the
MIB 2 ifinput and ifoutput tables of a router, bridge or concentrator. These
types of interfaces are usually considered promiscuous in that they listen for
all packets regardless of destination.
Using RMON Pods, one can get excellent information concerning the utilization
of the network they are attached to. Remember though, that any device that
performs bridging or routing will effectively blocks utilization measurements
without deploying a Pod on that specific segment. Statistics such as traffic by
protocol, by node address and connection lists enable analysis of the traffic on
the segment in a very detailed fashion.
While implementing a response time measurement on a LAN or WAN, it is very
smart to check the accuracy of the information you are gathering. Use a good
protocol analyzer such as a Network General Expert Sniffer or H-P LAN Probe.
On Wide Area Networks, some utilizations can be accomplished on some devices,
usually only for devices that dynamically allocate bandwidth as required. Some
high end multiplexers can provide this data. ATM Switches and Hubs definitely
can provide this data usually through the ATM MIB or through an Enterprise MIB
associated with the device itself.
Telephone trunk utilizations are available through most Switch and PABX
vendors although not usually using SNMP. Most have a terminal interface that can
be used to poll the data from. Some implementations use a Call Accounting system
to record detailed utilizations of the telephone trunks and stations.
Alarms and Alerts
What about the reporting of real time alarms and alerts? These need to be
processed on a near real time basis. The data needs to be disseminated as fast
as possible to the concerned parties in a meaningful manner. The Help Desk is
usually the best place to send these alerts but the problem is that the
"Some variable = 0" type message doesn't mean anything to that Help
Desk person -- unless you are using experts on your Help Desk! The cryptic data
needs to be converted to a format Help Desk personnel can understand. Second,
what does the Help Desk person do once a message is received? The Help Desk
person may not know about Unix or Windows NT or a specific network component.
The network management application must place, at their fingertips, a list of
processes to be accomplished once an alarm has been displayed. Information such
as who to call, procedures to accomplish, who to page, needs to be available at
their desktop to effectively track a problem through. Remember, if a Help Desk
person doesn't know what to do, they could spend the next few critical minutes
trying to find out where to start. This time is dead or non-productive time and
should be eliminated if at all possible. If a Help Desk person receives a
symptom via the telephone, if they have to return a call, costs the company
10-20 minutes every occurrence.
It is through this "Knowledge Base" that Mean Time To Repair (MTTR)
cycles get more efficient. Think about it; a problem is detected faster, a Help
Desk person sees the alarm and starts the diagnostic process, then dispatches
the technician with enough information to know the most probable cause (what
parts to take!) of the problem.
The actual alarm display needs to be simple and informative. By focusing
these messages away from graphical depiction, distribution of the information is
made much simpler -- and faster. Textual messages can even be displayed easily
on a VT-100 terminal dialed into a terminal server. Another example is to pass
critical alarms to a display pager, especially during off hours or weekends.
Alarm correlation is the process by which several alarms are narrowed from a
mass of problems to a root cause and side effects. Most software vendors for
network management systems sell artificial intelligence based inference engines
to correlate the alarms to a most probable cause -- some even produce a
percentage of probability on which device is causing the problem! Is this really
necessary? The data associated with these inference engines are based on the
relationships between components as illustrated in figure 4. When you analyze
what the inference engine is doing, one quickly realizes that maybe all the
artificial intelligence really isn't necessary. Figure 5 illustrates how to
accomplish the same task using simple database relationships -- minus the
percentages calculation on which device is causing the problem and minus the
serious horsepower associated with deriving this calculation! That is something
the on-site engineer has an idea of already -- once he's pointed in the right
direction.
Alarm correlation is good in that it narrows the possibilities to a common
denominator. Once alarm correlation is accomplished, other tasks can take
place automatically such as auto-generation of a Trouble Ticket or technician
paging. Even auto healing mechanisms can be initiated once alarm correlation has
occurred, i.e., a redundant circuit could be brought on line while the defective
link be placed in standby.

Figure 7 In figure 7, if the T1 link goes down, all systems behind it
are considered down. When the element managers for each of the devices report
alarms, alarm correlation analyzes the relationship between all of the alarms
and deduces a most probable cause. This is based on, most likely, a rules based
inference engine, analyzing the relationships between the alarmed entities.
If true artificial intelligence is to be applied, most implementations leave
out significant information pertinent to proper correlation. Most artificial
intelligence applications deal specifically with two types of data; rules based
information and heuristic information. Rules based information is that
information that can be used to depict entity relationships and how those
entities interact with each other. As such, most rules tables are static in
nature in that one inputs the information associated with the relationships. The
second type, heuristic information, is the dynamic information derived from
previous conditions that have occurred.
This same relationship can be accomplished in a database much simpler than
the artificial intelligence based solution. The artificial intelligence based
solution will provide a method of calculating, on a percentage basis, the most
probable cause of the root alarm. Root alarms are those alarms that actually
have something wrong. A side effect alarm is one where the alarm is caused by a
failure external to the managed object. In figure 5, a failure on the T1 link
actually reports alarms as follows:
T1 Link - Root Cause
Router - Side Effect
Video Codec - Side Effect
PBX - Side Effect
The database table could be set up in the following manner:
Parent Sibling Managed Object Address Location etc.
T1 Link Multiplexer 1 0 XYZ
T1 Link Multiplexer 2 0 ABC
Multiplexer 1 Serial1 Router 1 1.1.1.1 XYZ
Multiplexer 1 Port5 VC 1 1.1.1.2 XYZ Video Codec
Multiplexer 1 card 25-1 PBX 1 1.1.1.3 XYZ ACME PBX
By searching through a configuration table such as the one above, you can see
how easy alarm correlation really is. By building these relationships and
relating a table of active alarms back to the relationships between managed
objects, it is relatively easy to narrow down to a common denominator. Simply
parsing through the table looking for the highest point in the parent - child
relationship yields the same result as the AI inference engine. (In a lot
shorter time but minus the probability of failure calculation)
Heuristic information can also be derived provided access to alarm or symptom
histories is provided to some extent.
Help Desk Integration
The Help desk is the key to any service based organization. They are the direct
line to users having problems, tracking problems through to completion and
coordinating activities with the user community. As such, the information
associated with network alarms and alerts needs to be distributed to them in a
language they can understand. Translation of cryptic messages such as link
operationalStatus = 0 to interface X on device Y went down is
mandatory. They, above all other sections associated with an MIS organization,
need real time, pertinent information concerning problems, alerts and alarms.
Many network management systems in operation today, do nothing to pass
information to the Help Desk - unless Engineering types are manning the Help
Desk. This is where these applications really miss the boat in that they have
been written by programmers and engineers without looking at the business case.
Some of the programs were even written by programmers that have never had to
support a network or so it seems. The real business case is that you want the
Help Desk personnel to be well informed and have helpful information at their
fingertips. When the actual work process flow is documented, one easily sees
that key processes are handled by the Help Desk. The more informed they are, the
less time is taken in getting a problem resolution on its way to be
accomplished. If they have to find out what's going on and call the user back,
the time taken from the time a problem has been detected to the time a
technician is dispatched is increased dramatically.
The overall key to success in the operation of an MIS department is not to
hire expensive high level engineers to accomplish the work. People are more
motivated when they are hired and trained within the organization. This is also
the most cost effective if the expertise of the organization is distributed to
those lacking specific knowledge in those areas. Building a knowledge base of
symptoms and the tasks associated with finding and correcting those problems
just makes good common sense.
In the knowledge base, tasks such as check certain things, call this
technician or page this guy or even to ask questions to gather information,
places, at the fingertips of the Help Desk person, clear, definitive tasks to
accomplish to get the ball rolling.
By the process of elimination, a list of probable causes can be narrowed to a
single probable cause just by looking at a couple of things and asking the right
questions.
Building this knowledge base and deploying it throughout the organization,
enables new personnel to be productive day one. Furthermore, it takes the
knowledge of all (i.e. Desktop support, Server Support, Database Support,
Network Support, Unix Systems Support, etc.), collects that information in a
process flow format, and distributes it to all concerned.
Once a problem has been detected and the ball is rolling on getting the problem
owned by a Help Desk technician, a trouble ticket needs to be initiated. This is
vital in that it allows MIS organizations to monitor the type of work being
accomplished and by whom. It is also a key function in gathering the necessary
information to calculate the cost of maintenance. By knowing your costs, you can
work to get the costs down.
Data such as the number of specific models of hard drives or video cards that
have been repaired or replaced over the last month, quarter or year, allow the
MIS Manager to weed out those devices that cost too much to repair.
Analyses of this sort typically drive the cost of maintenance down greater than
20%. Because of the rollover of technology, these things need to be monitored in
that it may be more economically feasible to replace a whole desktop computer
than to have a hard drive controller replaced. Best of all, the end user feels
as if they are being taken care of. Consider this; the customer is happy because
the service is focused toward them and money is saved because it costs less to
replace that aging old box that kept breaking.
The ability to track the workload by department is an excellent tool for
management to analyze the number of personnel by skill and adjusting the
technicians to the work at hand. The Trouble Ticket application, if integrated
with network management, provides an easy flow of work and information in
tracking problems from start to analysis after the fact. The trouble ticket must
integrate well into the way the people accomplish work. Focus on the business
case and the work flow process.
Some trouble ticketing systems allow the technician to check inventory for a
specific part while on line, generate an overnight shipping label or
automatically flag an item that is low in inventory.
Trouble ticketing systems must have the ability to track Warranty and
maintenance administration information in an easy to use method. So many
organizations buy new equipment but do not track the Warranty information until
someone raises the flag that a maintenance contract is needed on the specific
type of device. If maintenance contracts do not start when warranty ends,
additional charges can be expected. All of these additional costs, lost time in
getting a part plus the additional 10 to 20% for maintenance contract penalties,
add up to money thrown away.
Once an alarm has been received, there are several steps required to correct the
problem associated with the alarm or symptom. Each alarm received should look
like a real symptom that makes sense to the user community... not just something
is down because some variable equals 0. Figure 8 depicts a common process flow
diagram for receiving and correcting problems.

Figure 8
The automation of processes that take an inordinate amount of time to
accomplish, needs to be analyzed and fitted into the overall application. Tasks
where support personnel check to see if an event happened need to be looked at
very closely to see if this event can be flagged and sent as an alert to the
overall application. In this manner, dead time such as time spent just
seeing if something has happened or if something is still working, can be
eliminated. The Network Management System, as a whole must address these types
of needs in that they must be easy to add new types of element management
functions quickly without having to rebuild the whole system every time.
One example is an MIS department that had one person spending around five
hours a day checking electronic mail connectivity across Microsoft Mail and
various gateways to other types of mail systems, such as SMTP, X.400, Profs,
All-in-1, and CC:Mail. Wouldn't this type of work flow problem be solved easily
by building an Electronic Mail poller that sent messages to echo type mailboxes
across the various systems. By polling across the systems, response time and
connectivity could be checked in an automated fashion. If the data associated
with this system were forwarded and parsed into the Network Management
application, the Electronic Mail Support person could be freed up to accomplish
other tasks associated with his or her department. Only if a problem was found,
would the concern arise.
In general though, these requirements need to be driven by the actual work
flow processes currently in place and trying to save time and money by
shortening these processes.
When a system is deployed across multiple sites and multiple organizations,
communications between the various workgroups enables planning, maintenance and,
best of all, knowledge, to be shared across the organization. Tools that enable
people to express ideas, work out solutions as a group, or just to ask questions
from users' desktops are drastically needed. These types of tools, commonly
referred to as Groupware, enable people to promote team building skills... no
matter where they are located physically. It is a known fact that people work
better when they feel as though they belong to a team.
Groupware tools include Group Sketch or Whiteboarding, Group chat,
Brainstorming, Group postit notes, group editing and the like really add to
ways' people can interact. The exchange of ideas and information across
departments, site and countries tend to get the whole organizations working
together.
Now that we've been over some of the business cases on how an ideal network
management application should be implemented, let's put the pieces together.

Figure 9

User Interface
Figure 10
Management Functional Domains (MFD's) are the segmentation of the Enterprise
Network Management System into localized functional domains. The grouping of
functions within specific domains allows alarm messages to be routed around
problems or faults especially when multiple paths exist. Furthermore, automated
SLIP or PPP sessions will enable alarm passing through dialup lines.
Not just alarm messages need to be passed to other affected MFD's. Alarm
correlation information and automatic diagnostics are examples of other
information relative to a fault that provide a better picture of what's really
happening on the other end.

Figure 11

Figure 12

Figure 13
In the above three examples, each of the sites or MFD's, visualize
an alarm on the link and several alarms on the other side of the link. This is
because the link fault is the root cause and all the rest of the alarms are side
effects. By being able to validate the alarms across a broken link, one can
quickly and efficiently determine the root cause. CPU utilization associated
with correlating the alarms is very low compared to the AI Inference engine
based Alarm correlation. One simply looks for alarms that are common to both
sides.

Figure 14
Following are a list of steps to take to develop a requirements matrix
associated with the management of network components and functions.
- Develop a list of information attainable from each managed object.
Describe in detail, each piece of information such as what the data element
is, average versus actual, counter, raw integer or a text message.
- Take the list to the Support organization responsible for that device
function and have them decide what's pertinent to their way of doing
business. Focus on information that will enhance their ability to accomplish
their job in an easier manner.
- Formulate the reporting strategy for the device.
- What elements of information are pertinent to alarm reporting. (Realtime)
- Establish thresholds. i.e. three counts in a one hour time period.
- Establish the priority of the alarm and any thresholds associated
with priority escalation of the alarm.
- Establish any diagnostic processes that could be run automatically
or the Help Desk could perform that would make their job easier.
- Establish acceptable polling intervals (Every five minutes, ten
minutes, one hour, etc.)
- What elements of information are pertinent to monthly reporting.
- Availability of devices and services.
- Usage and load.
- What elements of information are pertinent to trending and performance
tuning of network components and functions.
- Look at ways to combine data elements or perform calculations on
the data to make it more useful to the support organization.
- Interview Management to ensure the Network Management System is managing
all areas pertinent to the business unit.
- Explain the role and objectives of the Network Management System.
- Increase productivity throughout the support organizations.
- Reduce the Mean Time to Repair times on the correction of
problems.
- Provide a proactive approach to the detection and isolation of
problems.
- Enable collaboration and the flow of information across support
departments and sites.
- Gather the requirements for the management of any function important
to the business unit.
- Don't limit these functions to only SNMP manageable devices.
- If the devices associated with a function have no intelligence
whatsoever, go back to management later with a proposal to upgrade
the devices.
- Go implement the requirements. Focus each implementation toward each
requirement while integrating the total system.
- After implementation of each piece, notify the support organization
associated with the managed object or system that monitoring has started.
- At the first reporting period, go back and revisit the requirements with
each support organization and management.
- Reestablish requirements if necessary.
- Be advised that the reports and types of data will change as each
support organization becomes better informed.
During implementation, focus the alarm messages toward the Help Desk. They are
the front line of any MIS organization. Keeping them well informed of problems
is paramount to the successful deployment of the Network Management System.
Perform "Dry Runs" of alarms and the diagnostic steps associated
with getting the problem on the road to resolution in a quick and efficient
manner. Have the appropriate support organizations participate so that all
diagnostic steps can be identified and included. Don't leave out any management
notifications that may be necessary.
Train the Help Desk to input troubleshooting procedure pertinent to their
function into the diagnostics table. This can include anything from a user
calling in with a problem with an application (i.e. MS Word), to filling out
forms for a specific service to be provided to an end user.
The skills associated with the support organizations in one MFD may be
different from another MFD. The gathering of diagnostic procedures allows a
"sharing of the wealth" of knowledge across the enterprise. The
diagnostics procedures are a knowledge base of information, by symptom, of
problems and taskings and what needs to be accomplished to correct the problem.
Having the skills of Desktop Support, Unix System Support, Network Support,
etc., at the fingertips of Help Desk personnel increases their ability to
logically react to problems as their occur.
The Network Management System, as a total integrated system, must be modular and
easy to expand and contract as the needs of the business change.
Element Management Systems, whether they are third party products such as
SunNet Manager, HP Openview, Netview 6000, Netview, NetMaster, 3M TOPAZ,
Larsecom's Integra-T, or in-house developed pollers, need to be easy to
integrate into the whole system. Recognize that in the architecture, no EMS is
really aware of another. Awareness across EMS's needs to be accomplished at a
higher layer so that the EMS's can focus on their area of management within
their MFD.
Functions such as Alarm Correlation, Diagnostics across EMS's, etc., can be
accomplished using artificial intelligence principals within a relational
database. Almost all Manager of Manager products employ an AI Inference engine
to calculate the probability that one component is so many percent more probable
to break versus another. The inclusion of the AI Inference Engine drives up the
cost because of the engine AND the iron to run these types of calculations.
These types of decisions need to be accomplished through the support
organizations within the MFD because these folks know the local environment
better than any machine or personnel at another site. Doesn't the overall
application serve it's purpose better if it is more tightly integrated into the
business units?
The application of AI still needs to be applied but at a much different
level. Network General Distributed Sniffer Servers are an excellent application
of AI technology. By analyzing the relationships of protocols, traffic,
connections and LAN control mechanisms. The DSS uses AI to sort out problems at
a very low level before they become user identifiable problems and cause
degradation or downtime.
Additionally, artificial intelligence can be used to capture the heuristics
of network behavior and help with the diagnostics. The information available
from past alarms of similar problems associated with what was accomplished to
isolate and correct the problem needs to be incorporated into the overall
system.
As an MIS Manager, when you are approached by staff or vendors concerning
Network Management, there are a few key questions to ask.
How much will the system cost?
A lot of systems implemented today are accomplished by a Salesman specifying the
system to the MIS Manager. They typically push huge amounts of hardware and
software at the problems at hand. Some vendors will even tell you that cost is
not important; it's the capability that counts.
Additionally, because a network management system must be customized to the
local environment, there are a lot of hidden costs beyond the hardware and
software.
Will the proposed system integrate into and enhance my current MIS support
capabilities?
A lot of MIS Managers really miss the boat by not demanding that the overall
system be tightly integrated into the business units. If the system serves no
business purpose, you buying technology for technology's sake... the system is
doomed to failure.
Is the proposed system modular in design?
If everything in a Network Management System is loaded on one box, you're
setting yourself up for inefficient use of computing resources. If the system
contracts, the one box will be underutilized; if it expands, you'll be trading
that box in for a bigger one... losing money every time.
Is the product proposed just an Element Management System or is it an
Integrator of Element Management Systems?
Too many times, MIS Managers are sold a product like HP Openview or IBM Netview
6000 as a Manager of Managers System. Although, some integration functions are
capable in these systems, you take away from their ability to perform real
work... like polling and gathering information.
What does the system monitor?
Match the capabilities of the proposed Network Management System to the key I/T
services provided. If it is not a good match now, it won't be later.
Does the proposed system enhance the capabilities of the current support
staff or does it add more support staff?
Be especially careful in that some systems will do nothing to enhance your
current support staff capabilities and add five or ten more personnel to your
staff and to your budget. Not to mention, these people are usually highly
skilled specialists in Network Management... which don't come cheap.
Look at the total picture of the entire enterprise and match what is proposed
to what's currently operational. Ask the same questions for each site.
There are a lot of excellent products available today that provide capabilities
to manage not just hardware, but services and applications. The way that these
systems are implemented are also critical in that each management capability
installed must match a business need for such a system. Additionally, these
diverse systems must be integrated together and into the support organizations
to achieve maximum effectiveness.
Author: Douglas W. Stevenson
HTML Conversion: Jeff Murphy jcmurphy@acsu.buffalo.edu