Monday, March 4, 2013

Be Careful When Configuring Agent Heartbeat Interval. Otherwise Tons Of EventID 20022 & 20021…

Bumped into this situation at a customers location. Every hour the OpsMgr event logs of the RMS and MS servers wrote tons of EventID 20022, source OpsMgr Connector:

‘’…The health service {<GUID>} running on host <FQDN NAME AGENT> and serving management group <MG NAME> with id {<GUID>} is not heartbeating…’

And seconds later the same logs were flooded again with EventID 20021, source OpsMgr Connector:

'…The health service {<GUID>} running on host <FQDN NAME AGENT>and serving management group <MG NAME> with id {<GUID>} is available through the server <FQDN MS SERVER>..

This went on and on for months. The customer had no clue what caused it. Already the network was scrutinzed for issues but nothing was found.

So it was time for an investigation.

Cause
This MG is monitoring many Windows servers and the MG itself resides on a separate VLAN. First I suspected the MG to be too loaded by monitoring too many Windows Servers with not enough MS servers. But that wasn’t the cause. The MS servers were pulling their weight but were operating within their limits. The RMS was in good shape as well.

The SQL servers were also in good shape so no issues there as well. Then it was time to look at the connections, especially from the VLAN where the SCOM MG resides to all monitored Windows Servers. But the network specialists ensured me all is well on that part. No connectivity issues at all.

So back to SCOM it was. Time to look at the heart beat interval settings for the Agent (Administration > Settings > Agent > Heartbeat). And this surprised me. From the default interval setting, 60 seconds, it was lowered to 20 seconds. For all monitored Windows Servers this setting is enforced…

On the server side the heartbeat setting (Administration > Settings > Server > Heartbeat) was default, 3 missed heartbeats. But combined (3x20 seconds) there is only a time range of 60 seconds where an Agent is allowed not to communicate with the MS servers before an Alert (EventID 20022) is raised.

So whenever there is a small hiccup on the  network, changes are the event logs of the RMS and MS servers will be flooded. First by EventID 20022, telling you there is no communication, and a second later tons of Event ID 20021 telling you all is OK again.

Why
Of course there is always a reason for it, so I asked why this setting was modified. The customer told me they have some issues with a certain set of servers.

These servers might reboot and come back online very fast since they’re VMs on very good virtualization hosts. Yet the customer told me they needed to know whether the server rebooted so they lowered heart beat interval settings from 60 seconds to 20 seconds. Simply because the normal heartbeat interval combined with the server interval, was too much for the VMs. They simply rebooted and were fully functional again within the time range of 3 minutes!

Solution
Now the customer realized this wasn’t the way to go since it caused other unmentioned side issues as well. So they asked me what to do instead.

This is what I advised in order to solve the flooding of the event logs:

  1. Modify the heart beat interval to the default setting, which is 60 seconds;
  2. Monitor the OpsMgr event logs of the RMS and MS servers for the rest of the day in order to see no more flooding takes place.

This part will take care of the flooding issue. And yes, after this modification the flooding didn’t happen anymore. Which is way much better.

Part two my advice in order to be alerted (and even report upon!) the problematic set of servers which reboot too often:

  1. Create a Group containing the set of servers they want to know they rebooted;
  2. Write a Monitor or Rule (depending on what functionality they really want) in order to catch the reboot of that set of Servers, targeted against that Group (create the Rule/Monitor, targeted against a general Group like Windows Servers for instance, disable it and enable it through an override targeted against the Group created in Step 1).

This way SCOM is used which it’s meant for and when using a Rule it can be piped into the Data Warehouse which can be used for a customized Report, telling the customer what servers rebooted when during a certain time frame.

EventID’s you can track are (all to be found in the System Log):

  • EventID 6009 (<WINDOWS VERSION> Multiprocessor Free);
  • EventID 6005 (The Event log service was started);

Recap
Whenever there are some Windows Servers which require special attention because they reboot too many times / too fast, don’t use the Agent Heartbeat Interval option to identify those servers.
image

This setting will affect ALL monitored Windows Servers and is most likely to result into unwanted side-effects. Better is to create a Rule/Monitor aimed at catching specific EventID’s telling you the server rebooted.

Even when you want to modify the Agent Heartbeat Interval setting, please do so on a per Windows Server basis (Administration > Device Management > Agent Managed > select the Agent you want to modify, double click it > first tab Heartbeat > select option Override global agent settings > now you can modify Heartbeat Interval (seconds))
image

Another advice:
It’s Best Practice not to lower the Agent Heartbeat interval since 60 seconds is already tight enough. Many times it’s set a bit higher (with increments of 10 seconds) on a PER AGENT basis when those Agents reside in a part of (remote) network which has some latency issues.

Hopefully this post prevents the flooding of the OpsMgr event logs with EventID’s 20022 and 20021.

1 comment:

Wilson328 said...

You might want to configure some additional logic to distinguish between user-initiated reboots and reboots caused by the OS crashing.

For example, I have a rule that I created that looks for:
eventID=1074
source=user

...which is a user-initiated reboot. I send that to a command channel where I have a powershell script configured that places that system into maintenance mode for 15 minutes.