fmII
Thu, Jul 24th home | browse | articles | contact | chat | submit | faq | newsletter | about | stats | scoop 12:52 UTC
in
Section
login «
register «
recover password «
[Article] add comment [Article]

 Building a Network Management System
 by Mark Cooper, in Editorials - Sun, Mar 13th 2005 00:00 UTC

This article looks at current NMS offerings and considers how and what would make a "real" NMS.


Copyright notice: All reader-contributed material on freshmeat.net is the property and responsibility of its author; for reprint rights, please contact the author directly.

What is an NMS?

The normal definition of NMS is "Network Management System". This is nice and easy to say, but very hard to pin down to an exact specification. What constitutes a well-rounded NMS?

I believe it to consist of at least:

  1. Up/Downtime Monitoring
  2. Reporting
  3. Configuration Change Management
  4. IP/Asset Management
  5. Security
  6. Event Correlation/Root Cause
  7. Alerting

There are a large number of Free/Open Source Software and commercial systems that claim to be NMSes, but none come close to covering all this functionality. Typically, systems fall into either a Network Monitoring (Up/Down) or Network Reporting role, not both.

NMS Generations

The types of systems available can be crudely categorized into three distinct generations:

  1. Pure Up/Down Monitoring.
    Typically with just ICMP, but some with applications (DNS, HTTP, etc.).
  2. Event correlation.
    Polling using SNMP, ICMP, and applications. Alerting on SNMP traps and syslog.
  3. Root Cause Analysis.
    Advanced event correlation to ensure minimum false negative alerts.

Event Correlation/RCA

Event correlation is the core functionality of an NMS. Without it, too many false negative alerts are generated, which make the system ineffective.

Root Core Analysis takes event correlation a step further. Rather than just dampening alerts from nodes downstream of an existing problem, it only alerts on the real cause of a problem, to significantly reduce the time needed for a fix.

Efficient/Intelligent Polling

Currently, a typical NMS platform will consist of two main systems, with one solution doing the Up/Down monitoring, the other the reporting. This leads to extremely inefficient double polling of devices. Why ping a host to see if it's up when you've just gathered interface stats from it? Some systems can be integrated to help reduce this double polling, but only a single NMS solution will truly provide the most efficient use of the network.

To map, or not to map?

The traditional NMS provides a network map for operators to be able to point and click through to any problems. Some systems have dropped this functionality, claiming that operators only really need to be told what the real problems are. These are typically Second Generation event correlation engines, that just provide a list of problems for the operator.

However, no matter how advanced the logic is in an NMS, it cannot cover all problems, and providing a visual representation for operators to work with can provide major gains. The human brain works best with visual images rather than the written word. NMSes need a map!

It's all about the Man-Machine Interface!

Aside from alerting (via email, SMS, etc.), how should an NMS interface with the operators? There are two distinct camps, dedicated GUI and HTTP. A growing number of HTTP interfaces (typically with some Java thrown in) are being used.

While this type of interface may have its uses, it is not the best medium in an operational environment. A dedicated GUI is the only way to provide a fast, efficient, reliable mechanism for operators to interact with an NMS.

Putting the M in NMS

The M stands for Management, but what's being managed, exactly? Network problems, mainly. A single generic management interface is somewhat of a holy grail that some people have been chasing. Is it achievable?

How far should management be taken? Many vendors have proprietary management software for their systems to provide an alternative to the commandline. Should an NMS allow full management of a device without having to resort to a CLI? Some things can be done easily by SNMP, but what interaction should an NMS have with a device's CLI? RANCID provides an easy change management system for routers, but also shows the possibilities of being able to integrate functions into an NMS that typically are done at the CLI level.

Think being able to do mass changes (for example, SNMP community changes) via a few clicks on a GUI, rather than manually having to login to thousands of devices.

Current Solutions

F/OSS

I'll mention the commercial solutions as well, as they typically have far better Man-Machine Interfaces. This is a typical problem with F/OSS, as programmers don't usually make good UI engineers.

Commercial

  • HP OpenView
  • SMARTS
  • Aprisma
  • Netcool
  • Concord
  • Proviso
  • InfoVista

Recreation or integration?

The beauty of F/OSS is that we have a huge, growing repository of code. So, do we start coding the "perfect" NMS from scratch, or use the tools already provided and just integrate the functionality we require?

Some commercial vendors make big claims about how their code is multi-threaded and "industrial strength". Producing good, clean, efficient code that will run on many platforms and is part of a large system is hard to do! Such a large, complicated system can also be extremely hard for new coders to get into. Keeping the functionality compartmentalized into small programs can ease these problems. This ties in well with using existing toolsets and just concentrating on an integration issue.

OSSIM is taking the integration approach, and it is well worth watching how well this works. Obviously, the double polling issue rears its head here, and would be a serious limiting factor in any large implementation. Although OSSIM is coming from a security requirements background, it offers an example for the creation of a "proper" F/OSS NMS system. How much work is involved in integrating Nagios with RRDTOOL? Could the cheops-ng GUI be used as the frontend for Nagios?

How do our original NMS requirements map to existing F/OSS projects?

Up/Down Monitoring:Nagios, BigBrother
Reporting:MRTG, RRDTool
Configuration Change Management:RANCID
IP/Asset Management:Northstar
Security:Snort, Tripwire
Alerting:Sendmail, etc.

Most of the functionality is covered across a number of projects. This only leaves Root Cause Analysis. Unfortunately, this is probably one of the hardest things to do.

To Poll, or not to Poll?

First generation NMSes like Nagios and Big Brother rely on polling, via ICMP or an application-specific method (HTTP, FTP, etc.), to do their up/down monitoring. Unfortunately, this really isn't network management. It's just node polling, and has major disadvantages.

To poll means there is a polling interval. What is the state of your network during these intervals? Actively polling the network is also a major scalability problem. The larger your network, the more polling required. Active polling systems are fine for monitoring a handful of systems, but to manage a network, you have to look at other mechanisms.

This is where systems such as OpenNMS and JFFNMS come in. These are realtime event-driven systems. Events are typically from SNMP traps, but can come from other sources such as syslog. There is no polling interval as such in these systems. If a node goes down, an SNMP trap is generated by the switch immediately. You now have true realtime network monitoring.

Of course, SNMP traps are typically not generated on application failures. Most NMSes will resort back to polling to monitor applications.

The Next Generation F/OSS NMS?

It would be nice to see better support for enterprise/carrier-grade functionality in F/OSS NMSes, such as support for bulkstats, netflow, and RCA.

However, there is something I have not seen either F/OSS or commercial systems using: Host/Network sniffing. Having a local host-based sniffer or a dedicated sniffer on a mirrored switch port could leverage enormous gains for NMSes:

Network efficiency
No polling! No extra traffic is generated, as it relies on seeing exactly what's happening on the network.
Spotting problems immediately
It sees TCP RSTs, switch ports losing carrier signal, etc.
Real graphing
Not from graphing host to destination, but actual "real world" traffic.
The ability to track full user QoS
Tied into the network authentication platform (radius, et al), it can give real world user QoS reporting.
Extra functionality
Massive potential for per-IP-block monitoring/reporting, etc.
It's fast, flexible, distributed, and scalable!
 

Developing an NMS-centric pcap-based sniffer seems like the way forward. It could be easily integrated with current systems by being developed separately, and just generating SNMP traps when required.


Author's bio:

Mark Cooper can be reached at mark@mcooper.demon.co.uk.


T-Shirts and Fame!

We're eager to find people interested in writing articles on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an article gets a t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about.

[Comments are disabled]

 Referenced categories

Topic :: System :: Monitoring
Topic :: System :: Networking :: Monitoring

 Referenced projects

Big Brother - A highly efficient network monitor.
cheops-ng - A network management tool.
Just For Fun Network Management System - A PHP-based network management system.
mrtg - The Multi Router Traffic Grapher.
Nagios - A powerful network and system monitor.
Network Weathermap Creator - A network weather map creator.
NorthStar - An IP address tracking system.
OpenNMS - An enterprise-grade network management platform.
RANCID - A Cisco, Juniper, Foundry, and Redback configuration archiver.
RRDtool - Time-series data storage and graphing software.
Snort - A libpcap packet sniffer, logger, and lightweight IDS.
Tripwire - An intrusion detection system.

 Comments

[»] GroundWork Monitor wasn't included here
by Amy Abascal - Oct 10th 2007 15:03:42

GroundWork Monitor Open Source is a great option: http://www.groundworkopensource.com/

[reply] [top]


[»] A Real NMS
by MadEyeMoody - Sep 6th 2006 09:26:44

Most Developers out there today are all jacked up about SOA. Its easy to program, uses SSL / HTTPS for security, and its becoming very prolific. When you throw in J2EE and JMS, you now have all your Dev guys drooling.

Some of this stuff just doesn't work in certain cases. For example, lets say you have a process that collects performance data on a device in clumps. Like Netflow data. Data sets are huge. And you want your data keyed correctly so that it is usable and functional. So, you end up encoding Netflow data into XML records. This becomes a huge behemoth across the wire as not only reach record delinited, it is also escaped. For example, you use a field called ACMEVALUE. In XML speak, thats:
<ACMEVALUE>1234567890
</ACMEVALUE>
So now, you've added alot more data to the dataset for the sake of flexibility. And this really adds up across the wire!

The second thing you do is that you take a long time to process.A SOAP transaction cannot be completed until all of the data is encapsulated in the SOAP envelope. This may take an inordinate amount of time and blocks vital resources dirung the process.

In SOA, when you start using stuff like their publish and subscribe in near real time, it ends up blocking during the IO phases which slows down everything and makes it non-scaleable in large environments.

Additionally, everyone is wrapped up around a CMDB concept as introduced by ITIL. Not to say a CMDB won't work.... In some cases, a CMDB is used... Its only localized. Think about windows registries and you get the jist. Stuff happens too fast on some levels of the data to keep this data in a centralized spot. You ebnd up having to mix data elements and locations dependent upon the volatility and usefullness of the data elements themselves.

SNMP, when you look at it, is a schema for a highly distrubted database where the data access mexchanism is accomplished via SNMP versus somethnig like SQL*Net or ODBC and SQL.

The thought of a Real NMS is evolving very quickly - almost haphazardly. Yet the technology being used to do the next generation NMS systems lacks stability and may not be very scalable. It has been said that Corporate America is spending a huge amount of money to convert all their applications to SOA and JAva only to lose functionality, stability, and scalability. And worse yet, they are offshoring the coding in many cases making it impossible to support in the future with off shoring support as well!

[reply] [top]


[»] OpenNMS should be added...
by MadEyeMoody - Sep 6th 2006 07:47:01

OpenNMS is doing very well these days. It should be part of the list.

[reply] [top]


[»] Solar Winds
by Terrance - Dec 26th 2005 10:13:48

Would you add Solarwinds to this list?

[reply] [top]


[»] RE: Building a Network Management System
by MadEyeMoody - Apr 4th 2005 12:12:21

I wrote a white paper in 1994 called "Network Management: What It Is and What It Isn't" that is still somewhat pertinent even after all these years.

You're correct. Rarely do all of the functions of the FCAPS model find their way into valid implementations... However, I think that if you work on the things that you can get the most value, you can achieve some level of success without doing the know all - end all - be all NMS implementation.

RCA - (Another overused term if ever there was one!) For my own intentioned purposes, I define 6 levels of correlation within Management implementations. I do this so that I can explain to a management person a specific function without having to deal with FUD presented from Vendors.

Event correlation
Device Correlation
Alarm Correlation
System Correlation
Business Correlation
Performance Correlation

RCA is difficult. In fact, I see even the commercial products that tout it, can be misleading in certain situations. For example, if you base your Root Cause Analysis on a Topology and you do not programmatically reverify the topology during an outage, your analysis may be based on facts that are not true.

For certain polling situations such as polling a service via a spoofed transaction - This can be VERY Dangerous! I have witnessed some app developers "tighten up" their transactions for the spoofed transactions. This can skew the results or even mask issues.

Passive monitoring holds great promise because you can gain the perspective of the ACTUAL END USER... Not a spoofed user. So if you have a Catalog / Shopping Cart application, you can see what customers are ACTUALLY doing versus sending in a secret shopper.

[reply] [top]


[»] Monitor depth
by Daniel Feenberg - Mar 14th 2005 19:05:57

One thing I rarely see discussed about monitoring, is how throughly each server is tested. For instance, for SMTP, testing could stop at any of the following points:

1) TCP connection
2) SMTP banner received
3) test message accepted
4) test message delivered

For DHCP testing could stop at

1) TCP connection
2) get an address from a pool
3) check that address works (in some sense)

In our experience, lots of broken servers
will complete step 1.

[reply] [top]


[»] Intermapper
by Mark - Mar 13th 2005 23:52:41

Intermapper is another interesting commercial package with an emphasis on SNMP monitoring and automated mapping. They used to provide free demo downloads directly from their website, and I remember pleasant afternoons mapping out huge networks with the package.

[reply] [top]


[»] SmokePing
by X-Nc - Mar 13th 2005 19:29:51

I would add SmokePing to the list of tools.

It has come in very handy many times for the rather large, e.g. the biggest Intranet on the planet, network where I previously worked. It also comes from Tobi, who made MRTG and RRD.

--
If I actually _could_ spell I'd have spelled it right in the first place.

[reply] [top]


    [»] Re: SmokePing
    by gollum - Mar 14th 2005 10:51:18


    > I would add SmokePing to the list of

    > tools.

    Yup, smokeping comes under the general RRDTOOL banner, RRDTOOL being the data storage method and smokeping et al being the collection/reporting method. Loadsa RRDTOOL based tools at http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/rrdworld/index.html

    [reply] [top]


[»] Project Maintenance
by imipak - Mar 13th 2005 17:17:00

It is no good, writing the perfect program, if you then neglect it. Systems evolve, so the monitoring software must do likewise. Big Brother is a classic example of "bit rot", for example. There have been no updates to Big Brother for many years. In that interval, Perl has been largely resculpted, virtual systems have largely supplanted physical servers and many organizations are either using, or have adopted, SANs and other "special interest" network technologies.

In the tradition of Unix tools, there won't be a "one size fits all" solution. Rather, there will be a large number of specialist tools that can be integrated. That is inevitable, as that is the only solution that has proved workable in the long-term.

However, NMS systems don't play well together. Typically, you would need to use several solutions (Smokeping, MRTG, Big Brother/Big Sister, pchar, Ganglia, etc) and hope that you've covered the bases. You'd be very lucky if you did. More likely, you'll have an uneven mix of data that overlaps, possibly conflicts and likely confuses more than helps. There is no easy way, for example, to tell MRTG that if Big Brother detects a host as down, it needn't bother querying it for SNMP data. There is no easy way to tell Big Brother that, if pchar detects severe latency, it needs to extend its timeouts accordingly.

As far as I know, none of these programs play nice with ECN, so if the network detects overload, none of these programs can be instructed to throttle back. MRTG uses SNMP but I can see no obvious way to take advantage of SNMPv3 over the earlier variants. MRTG also supports IPv6, but IPv6 supports mobility and MRTG uses static hostnames. None of them work with multicasting, to the best of my knowledge. Nor do they support RSVP to reserve bandwidth for communication.

Frankly, I'm not impressed with the state of NMS at the moment. Too much overlap, too little useful information, poor integration, poor maintenance and poor designs. Large infrastructures are hard to maintain, because there is really nothing to maintain them with. That is not a good situation to be in.

[reply] [top]


    [»] Re: Project Maintenance
    by Simon Clift - Mar 14th 2005 09:05:05


    > Big Brother is a classic

    > example of "bit rot", for example.

    I've deployed a program, BigSister, that uses the BigBrother protocol (which is lightweight and simple to deploy across diverse systems, in my case Unix, Windows and VMS). The Perl structure of BigSister is, at least at first, only translucent, but I was able to make the extensions I required. My only complaint was speed of update; it is a web-based system so the browser refresh rate was a problem. The advantage of that is, however, no client to deploy.

    Big Sister on SF

    In my experience, setting up a monitoring system is an easily underestimated task.

    [reply] [top]


    [»] Re: Project Maintenance
    by gollum - Mar 14th 2005 11:02:37


    > It is no good, writing the perfect

    > program, if you then neglect it. Systems

    > evolve, so the monitoring software must

    > do likewise. Big Brother is a classic

    > example of "bit rot", for example. There

    > have been no updates to Big Brother for

    > many years. In that interval, Perl has

    > been largely resculpted, virtual

    > systems have largely supplanted physical

    > servers and many organizations are

    > either using, or have adopted, SANs and

    > other "special interest" network

    > technologies.

    I totally agree!

    One of the major problems with NMSes is how to keep it 'in-sync' with the network.

    Without the NMS being aware, to a greator or lesser extend, of the protocols ( layer 3/4 ) and the network ( layer 1/2 ) it's hard to do decent Root Cause Analysis.

    [reply] [top]


    [»] Re: Project Maintenance
    by Katja - Sep 4th 2006 17:25:16


    > In the tradition of Unix tools, there
    > won't be a "one size fits all" solution.
    > Rather, there will be a large number of
    > specialist tools that can be integrated.
    > That is inevitable, as that is the only
    > solution that has proved workable in the
    > long-term.


    I agree with you. The scale of options by the suggested NMS varied very wide.
    It takes a lot of time to install, administrate and monitoring different programs.

    Katja

    [reply] [top]


[»] Zabbix should have been mentioned!
by welshpjw - Mar 13th 2005 17:10:47

http://www.zabbix.com/ is VERY good and VERY configurable! I recently found out about it and dropped looking at other options.

[reply] [top]


    [»] Re: Zabbix should have been mentioned!
    by Michael Shigorin - Mar 18th 2005 09:59:21

    +1

    --
    Michael Shigorin mike SOMEWHERE AT altlinux PLUS DOT org

    [reply] [top]


[»] Diverse enviroments cannot support 1 does all
by Gustaf Gunnarsson - Mar 13th 2005 15:55:45

I think that one has to accept that the word NMS may mean different things depending on where in the chain of EM's (element managers) you are.

For instance the top-layer NMS handling alarms, which in a large network handles well above 100k alarms on average per day is complicated enough as it is.

Once you get this big, the configuration management system or the reporting system will be a system in itself, even if labeled as the same product simply because the number of different equipment you have will have.

The key is interoperability between systems and this is what should be achieved, trying to make a system which can do all will give no benefit but headache once you need to extend it to support a new kind of element.

What I am saying is, according to me, the OSS systems which are today are fine, and if anybody wants to take this further it should be done by making for instance a reporting system which can easily be configured to get data from different sources like RRD/SQL databases and then present the information structured and in a generic way to the user.

The same would apply to the alarmhandling system, focus should be on handling alarms in a generic way, dont try to interface all different kinds of equipment and verify that they do work, there are plenty of systems allready doing this excellently. Just try to recieve alarms which somebody sends, interpret the language and the message and present this in a generic fashion to the operator.

--
failure is not an option (f) 2008 bus[iy]ness as usual team

[reply] [top]


[»] Big Brother
by pitchpoledave - Mar 13th 2005 07:39:58

Hi,
I don't think that you have fully explored the potential for this article. If you did full a feature comparison it might shed more light on the topic.

For eg, Big Brother CAN do "real time" monitoring via snmp traps and reporting via LARRD or Butter. It cal also do SLA reporting which I don't think that any of the other OSS products can do.

Also one HUGE advantage that Big Brother has is that it has agents for windows and other nix..This way you can do NETWORK and SERVER monitoring from the same console. Just checking ports up/down doesn't cut it any more.

[reply] [top]


[»] Monitor depth
by Daniel Feenberg - Mar 13th 2005 07:10:52

One thing I rarely see discussed about monitoring, is how throughly each server is tested. For instance, for SMTP, testing could stop at any of the following points:

1) TCP connection
2) SMTP banner received
3) test message accepted
4) test message delivered

For DHCP testing could stop at

1) TCP connection
2) get an address from a pool
3) check that address works (in some sense)

In our experience, lots of broken servers
will complete step 1.

[reply] [top]


    [»] Re: Monitor depth
    by Bill Carlson - Mar 15th 2005 08:00:13


    > One thing I rarely see discussed about

    > monitoring, is how throughly each server

    > is tested. For instance, for SMTP,

    >

    Many of the systems discussed, at the very least Nagios, base their monitoring around an open plugin system. One can certainly write a plugin to comprehensively test a service, it's just a matter of doing it.

    Some plugins just require a little thought to implement. For instance, on a host with several virtual websites, it's not enough to get to the IP and speak HTTP, you need to know that a particular site is being accessed. An easy setup is to use check_http to check for a specific page on each site and do a critical error on a 404 (page not found). Simple and works.

    [reply] [top]


[»] Event Correlation
by Brian E. Seppanen - Mar 13th 2005 06:05:24

The topic is overly broad and not many details.

It mentions Event Correlation, but doesn't mention open source option - sec (Simple Event Correlator) at http://kodu.neti.ee/~risto/sec/

Where I work,

Nagios, snmptrapd, sec, perl, mysql, sendmail, procmail, paging and all those other open source tools have managed to create a wonderful environment which we're going to scale via management tools.

Our parent company constantly talks of 6-18 months for implementing anything, which when implemented still doesn't cover simple polling, service verifications, and is almost all trap based event correlation. Good for somethings, but mail servers and proxy servers don't send out traps when misconfigured and the customer is impacted.

I would just like to say that people shouldn't assume that a commercial product is going to do any better than OSS in the right hands. That isn't to say that a commercial product in the right hands wouldn't be able to do it all, but you'd have to hope there is a budget for it.

--
Area 54: The Secret Government Disco Labs in Provo Utah

[reply] [top]


    [»] Re: Event Correlation
    by gollum - Mar 14th 2005 11:18:47


    > It mentions Event Correlation, but

    > doesn't mention open source option - sec

    > (Simple Event Correlator) at

    > http://kodu.neti.ee/~risto/sec/


    Event Correlation can be just event dampening with a bit more logivc which is still a long way from the Root Cause Analysis that is the real requirement.

    RCA is *hard* to do :/

    [reply] [top]


      [»] Re: Event Correlation
      by Michael Shigorin - Dec 15th 2005 08:41:23


      > Event Correlation can be just event

      > dampening with a bit more logivc which

      > is still a long way from the Root Cause

      > Analysis that is the real requirement.

      Just for a record (as this editorial is bookmarked for NMS stuff): seems like inspecting AutoNOC might be useful while implementing RCA in any of available free NMS systems. (linked here)

      --
      Michael Shigorin mike SOMEWHERE AT altlinux PLUS DOT org

      [reply] [top]


[»] very interesting...
by Jean-Luc Fontaine - Mar 13th 2005 05:00:55

...and nicely summarized. I have to read it again more thoroughly.
Have you taken a look at moodss (http://moodss.sourceforge.net) which I hope has a good enough GUI for a free piece?
Thanks for this nice article. Jean-Luc

[reply] [top]


[»] Polling is very important
by thaig - Mar 13th 2005 04:51:53

Polling is important because it is positive proof that a service is working for end-users and meeting it's Service Level Agreements.
If you don't simulate the end user's access method then you aren't testing the service's availability. For this reason polling will never go away.
Some quite large networks are polled for HTTP and ICMP by various commercial solutions and it works well (1000's of polls every 10 minutes).

[reply] [top]


    [»] Re: Polling is very important
    by gollum - Mar 14th 2005 11:12:25


    > Polling is important because it is

    > positive proof that a service is working

    > for end-users and meeting it's Service

    > Level Agreements.


    I disagree. As I stated in the article, a pcap based system would actually *see* the end users traffic and therefore provide a *real* SLA.


    > If you don't simulate the end user's

    > access method then you aren't testing

    > the service's availability. For this

    > reason polling will never go away.


    Polling is not real world. Polling from one GigE connected server to another GigE connected server is not simultaing the end users access method


    > Some quite large networks are polled for

    > HTTP and ICMP by various commercial

    > solutions and it works well (1000's of

    > polls every 10 minutes).


    Hmmm, define 'works well' ? :) I would argue not. Ten minutes before I know my main authentication platform is down? No, thats not 'works well' for me ;)

    [reply] [top]


      [»] Re: Polling is very important
      by Bill Carlson - Mar 14th 2005 14:35:57


      > I disagree. As I stated in the article,

      > a pcap based system would actually *see*

      > the end users traffic and therefore

      > provide a *real* SLA.

      Without some kind of polling, you can't say whether a service is actually working. Depending strictly on generated traps/events is not good enough, that mechanism can fail and would then be silently dead. Polling is a good cross check.


      > Polling is not real world. Polling from

      > one GigE connected server to another

      > GigE connected server is not simultaing

      > the end users access method

      You're picking on certain cases. The argument still stands, polling is useful.


      > Hmmm, define 'works well' ? :) I would

      > argue not. Ten minutes before I know my

      > main authentication platform is down?

      > No, thats not 'works well' for me ;)

      Again, specific case fails to address the argument.


      [reply] [top]


        [»] Re: Polling is very important
        by gollum - Mar 14th 2005 22:51:41


        > Without some kind of polling, you can't

        > say whether a service is actually

        > working. Depending strictly on generated

        > traps/events is not good enough, that

        > mechanism can fail and would then be

        > silently dead. Polling is a good cross

        > check.


        Nope. With a sniffer on the host or a mirrored port you would see any problem immediately, rather than waiting for a polling period.

        It would not be 'silently dead'. You would see TCP RST's or ICMP errors, the list goes on. Polling would miss about IP problems that would only show up as maybe increased latency. Packet loss, retransmission, TCP window size changes etc etc - all picked up by a pcap based monitor.

        I'm not ruling out polling completely, just saying it should not be the core way of finding problems on the network

        [reply] [top]


          [»] Re: Polling is very important
          by Bill Carlson - Mar 15th 2005 07:52:56


          >

          > Nope. With a sniffer on the host or a

          > mirrored port you would see any problem

          > immediately, rather than waiting for a

          > polling period.

          >

          AGAIN, if the host can't communicate back to the NMS, you won't know it's dead. You need some kind of polling going, at the very least a check that traps or whatever mechanism gets back to the NMS still works.



          > I'm not ruling out polling completely,

          > just saying it should not be the core

          > way of finding problems on the

          > network

          >

          No, your original point was that polling was worthless. You are now agreeing with my point, you need to have a polling mechanism somewhere, if for no other reason than to cross check that certain services are still active (SNMP traps are being sent and received, event messages are being sent, received and processed, etc). I'm not talking about the service itself, you're correct that an active monitor (I don't like the term 'real time') would be a plus for some services. But you need something to make sure your reporting infrastructure is working and that means periodic "I'm alive" messages, ie polling.

          [reply] [top]


            [»] Re: Polling is very important
            by gollum - Mar 15th 2005 11:16:07


            > AGAIN, if the host can't communicate

            > back to the NMS, you won't know it's

            > dead. You need some kind of polling

            > going, at the very least a check that

            > traps or whatever mechanism gets back to

            > the NMS still works.


            If the host cannot communicate back to the NMS, you would know instantly ( rather than waiting for a polling period ) as the connection would fail. By this I mean that the pcap monitoring would be constantly sending info back to the NMS via either an always on TCP connection or a stream of UDP.

            Why did the host suddenly lose connection? The NMS should already know the answer as it should be able to do root cause and see a switch or some other fault in the network path, etc etc.

            If you are receiving SNMP traps or syslog from a router, and these suddenly stop, there is obviously a problem.

            The more you integrate the NMS into the network, the easier the RCA becomes. If your entire infrastructure is set-up to do syslog and SNMP trap back to the NMS, it should already know why the host cannot communicate back because it's just tracked a network admin log in to a router and delete a static route by mistake ( for example ).



            > No, your original point was that polling

            > was worthless. You are now agreeing with

            > my point, you need to have a polling

            > mechanism somewhere, if for no other

            > reason than to cross check that certain

            > services are still active (SNMP traps

            > are being sent and received, event

            > messages are being sent, received and

            > processed, etc). I'm not talking about

            > the service itself, you're correct that

            > an active monitor (I don't like the term

            > 'real time') would be a plus for some

            > services. But you need something to make

            > sure your reporting infrastructure is

            > working and that means periodic "I'm

            > alive" messages, ie polling.


            I'm proposing you sniff the *actual* traffic. You see *ALL* the traffic. Why poll a device when you can see traffic going to and from it? You can see people connecting to port 80, GETing a url, etc etc. This establishes both the host and service are up and running aswell as the routing between the host and the person connecting.

            If the the webserver dies, you would instantly see TCP RST's, no waiting for a poll, instant, 'real-time' :)

            [reply] [top]


              [»] Re: Polling is very important
              by Jason Martin - Apr 25th 2005 14:42:51

              Most NMS's done persist a connection to the central server, so you can't check for a 'failed connection'. Additionally, most event managers don't have a comprehensive list of all host sending them events so they don't know to look for non-events.
              The traffic behavior of your application helps decide if polling or sniffing is appropriate. Nobody wants to get up at 2:00am to respond to an alarm that there is no traffic to a host, only to find that the reason is it is Christmas Eve and nobody happens to want to visit the site. Ideally the lack of traffic would kick off a poll to perform an independant check.
              Also, there is a difference between 'Event Management' and 'Network Monitoring'. EM is more along the lines of passive, someone-sends-in-alerts, while NM is more of the check-and-verify-it-is-working.

              [reply] [top]


              [»] Re: Polling is very important
              by Todd - Aug 2nd 2005 12:50:32


              >

              > % AGAIN, if the host can't communicate

              > % back to the NMS, you won't know it's

              > % dead. You need some kind of polling

              > % going, at the very least a check that

              > % traps or whatever mechanism gets back

              > to

              > % the NMS still works.

              >

              >

              >

              > If the host cannot communicate back to

              > the NMS, you would know instantly (

              > rather than waiting for a polling period

              > ) as the connection would fail. By this

              > I mean that the pcap monitoring would be

              > constantly sending info back to the NMS

              > via either an always on TCP connection

              > or a stream of UDP.

              >

              > Why did the host suddenly lose

              > connection? The NMS should already know

              > the answer as it should be able to do

              > root cause and see a switch or some

              > other fault in the network path, etc

              > etc.

              >

              > If you are receiving SNMP traps or

              > syslog from a router, and these suddenly

              > stop, there is obviously a problem.

              >

              > The more you integrate the NMS into the

              > network, the easier the RCA becomes. If

              > your entire infrastructure is set-up to

              > do syslog and SNMP trap back to the NMS,

              > it should already know why the host

              > cannot communicate back because it's

              > just tracked a network admin log in to a

              > router and delete a static route by

              > mistake ( for example ).

              >

              >

              >

              > % No, your original point was that

              > polling

              > % was worthless. You are now agreeing

              > with

              > % my point, you need to have a polling

              > % mechanism somewhere, if for no other

              > % reason than to cross check that

              > certain

              > % services are still active (SNMP traps

              > % are being sent and received, event

              > % messages are being sent, received and

              > % processed, etc). I'm not talking

              > about

              > % the service itself, you're correct

              > that

              > % an active monitor (I don't like the

              > term

              > % 'real time') would be a plus for some

              > % services. But you need something to

              > make

              > % sure your reporting infrastructure is

              > % working and that means periodic "I'm

              > % alive" messages, ie polling.

              >

              >

              >

              > I'm proposing you sniff the *actual*

              > traffic. You see *ALL* the traffic. Why

              > poll a device when you can see traffic

              > going to and from it? You can see people

              > connecting to port 80, GETing a url, etc

              > etc. This establishes both the host and

              > service are up and running aswell as the

              > routing between the host and the person

              > connecting.

              >

              > If the the webserver dies, you would

              > instantly see TCP RST's, no waiting for

              > a poll, instant, 'real-time' :)

              >

              And what happens if nobody is accessing the web server or smtp server in question for 5 or 10 minutes ?

              [reply] [top]


    [»] Re: Polling is very important
    by MadEyeMoody - Sep 6th 2006 07:43:00


    > Polling is important because it is

    > positive proof that a service is working

    > for end-users and meeting it's Service

    > Level Agreements.

    > If you don't simulate the end user's

    > access method then you aren't testing

    > the service's availability. For this

    > reason polling will never go away.

    > Some quite large networks are polled for

    > HTTP and ICMP by various commercial

    > solutions and it works well (1000's of

    > polls every 10 minutes).

    >

    Polling via services can be significantly enhanced through passive monitoring techniques. If you're watching sessions via end users and these are working correctly, why poll? However, when you do no see the traffic or sessions any more or you see session issues, you should poll to verify.

    Intelligent polling in SNMP is vital. First of all, an ICMP ping may not be a reliable mechanism in your environment. I have seen environments when Pings are rate limited or blocked. (I even saw one environment where they attempted to block all of ICMP. Doh!)

    I use a technique in Open Service NerveCenter that I call implicit status determination. In this technique, I use the validity of other SNMP polls to imply a status for a higher order object. For example, I use good status polls of Interfaces via ifEntry as valid good status for my Node Status. When this happens, I use the finite state Machine function of NerveCenter to hold off actual poling until a given interval therein creating a sliding window for status and status polling.

    When I first deployed this techique, I was managing a series of Centillion switches employing LANE. I was able to maintain a 20 second status interval while actually polling was reduced to 1 in 8 on average.

    Using this same technique on a 2 minute interval, I was able to benchmark against HP Openview polling on a 5 minute status interval using less than 25% bandwidth that OpenView NNM used.

    [reply] [top]




© Copyright 2008 SourceForge, Inc., All Rights Reserved.
About freshmeat.net •  Privacy Statement •  Terms of Use •  Trademark Guidelines •  Advertise •  Contact Us • 
ThinkGeek •  Slashdot  •  ITMJ •  Linux.com •  NewsForge  •  SourceForge.net  •  Surveys •  Jobs •  PriceGrabber