Thursday, November 7, 2013

SCOM 2012 R2 Network Monitoring: Where Are The Alerts?!

First some background information
Okay, SCOM 2007 had some serious issues with network monitoring. So in SCOM 2012 this component got a complete overhaul and is rewritten from the ground up. And indeed, network monitoring in SCOM 2012 has improved compared to SCOM 2007. But to say it has really become top notch is a bit too much.

No, SCOM 2012 won’t replace the pure bred network monitoring tools. But guess what? Those tools will never replace SCOM 2012 as well. Ever. No matter what the marketing departments of those very same vendors want to make you to believe.

But when the network monitoring part of SCOM 2012 is put into perspective (SCOM 2012 monitors tons of work loads, whether it’s on-premise, cloud based, mobile units and from different angles, in- and outside) it’s okay. It’s has become an integrated part of the famous 360 degree monitoring. And for once I am on par with the marketing team of Microsoft because on this topic they tell the truth without any over estimation.

And now what?!
However, some things seem not to change and can still cause some strange issues. Suppose you have a brand new SCOM 2012 R2 RTM environment in place and everything is by the book. Many servers (Windows & Unix) are monitored and many different kind of workloads running on those very same servers. And yes, also many important network devices are being monitored.

And now one of those important monitored network devices goes down. In this case their were other monitoring solutions in place as well and they triggered the alarms. However, SCOM who’s monitoring that network device as well, stayed quiet. And now for a few minutes but for a long long time. And reported the network device to be HEALTHY!

Time to investigate
This really puzzled me so it was time for a deep dive into the way SCOM monitors network devices and alerts upon them. I agree, noise is bad but not Alerting when something is really amiss is even worse!

In Health Explorer of any given monitored network device you’ll find these two Unit Monitors:

  1. ICMP Ping
  2. SNMP Ping

These two Unit Monitors roll up to the Dependency Monitor Network Device Responsiveness, as seen in this screen dump:
image

So far so good. Both Unit Monitors are targeted against the Class Node, which is basically any monitored network device. However, per Unit Monitor there is an override in place which disables it.

The ICMP Ping Unit Monitor is disabled when the network device is covered by SNMP only, and the SNMP Ping Unit Monitor is disabled when the network device is covered by ICMP only. And this makes perfect sense.

But the configuration of those Unit Monitors really puzzled me.

Unit Monitor SNMP Ping
This Unit Monitor has some settings which I don’t fully understand. Let’s take a look at the Knowledge which describes this Unit Monitor in Health Explorer:
image

The options Interval and Number of Samples are most important here. First of all the Interval on this Unit Monitor isn’t 240 seconds in SCOM 2012 R2, but 300 seconds, which is 5 minutes. The Number of Samples is indeed set to three. Basically meaning any given monitored network device can be down for 15 minutes before SCOM 2012 R2 triggers an Alert!
image

Another thing which I am not happy with is the Health State when the network device doesn’t respond. It’s not set to Critical but to a Warning status:
image

However, when a network device goes down, I want it to be a Critical Alert, not a Warning. However, since this Unit Monitor (and the ICMP Ping Unit Monitor) roll up to a Dependency Monitor, which also triggers the Alert, this kind of modification shouldn’t be done on the Unit Monitor level.

So for the Unit Monitor SNMP Ping I set these two overrides:

  1. Interval: from 300 seconds to 30 seconds;
  2. Number of Samples: from 3 to 2.

So now this Unit Monitor will change State after a minute when a monitored Network Device is down:
image

Time to take a look at the second Unit Monitor, ICMP Ping.

Unit Monitor ICMP Ping
This Monitor is configured a bit differently compared to the SNMP Ping Unit Monitor. But still it needs some serious attention. This is what Health Explorer tells us:
image

So this Unit Monitor changes State after 6 minutes (Interval of 120 seconds x Number of Samples, 3) which is still too much. Also a Warning State is generated, not a Critical condition…

Time for some Overrides here as well. So now this Unit Monitor will change State after a minute when a monitored Network Device is down:
image

Time to move on to the Dependency Monitor, Network Device Responsiveness since I want a Critical Alert with Priority High (for the Notifications which sends out only New Alerts which are Critical and have Priority High).

These are the Overrides I set:
image

Time to test it
And now a new network device was added to SCOM to be monitored. This was a test network device. So when SCOM was monitoring it, the network cable was unplugged.

And YES! After a minute SCOM raised a Critical Alert with priority High. This Alert was neatly pushed out by the Notification Model as well. Awesome!

Recap
When you’re running SCOM 2012 R2 and are monitoring network devices, check the settings of the Monitors and make sure whether they match with the requirements of your organization. Changes are you have to make some modifications Smile.

2 comments:

Anonymous said...

Still think they should introduce an extra monitoring interval in to the schema. One interval for when the device is operating normally, and one for when the device is in a bad state. Have the bad state interval much more aggressive, so that it detects when the device back up and running faster. Would save on those 'going looking for a problem that's already fixed itself' moments.

And are we ever going to get dependencies built in? When Microsoft first announced they had licensed the EMC Smarts technology back in 2007 they were going on about being able to use EMC Smarts' built in root cause analysis abilities to pinpoint the cause of the problem amongst all of the clutter of red herring server and other infrastructure alerts. But here in 2013 in 2012 R2 land we still get bombarded with dozens of alerts every time we lose an internet connection to a remote location.

deroum said...

Thanks for doing this work for us all. Saved us some time and trouble.