The Key Parameters Explained
Why do we monitor so many key parameters? See below an explanation for each of the most commonly monitored parameters.
If free space on a disk drops below an agreed threshold, action can be taken before it causes a problem to either clean up the disk or increase the disk size. We continue to monitor disks for 5 working days after the free space issue is back above the threshold before closing off the ticket.
If a server goes offline, or reports an unexpected shutdown event. These are treated as critical and dealt with immediately.
If a server (or other device such as SAN/DAS, or NAS) reports a hardware component failure or predicted failure, e.g. hard disk, power supply, fan, or ambient temperature too high.
If the UPS (Uninterruptible Power Supply) detects that the incoming power supply has failed, connectivity to the software agent on the servers is lost, or if the battery needs replacing
Where there are multiple internet connections with automated failover between them managed by the firewall, or multiple firewalls set up in High Availability mode, we receive alerts if an internet line or device goes down.
Antivirus software can be set to be automatically deployed to all and any new machines that join the network, we receive notifications into a central console if any endpoint reports an infection or application error.
Servers will be patched and rebooted monthly with all critical updates, with automated reboots taking place overnight on a prescribed schedule on agreed dates to avoid disruption
Windows Event Logs
Understanding what does and does not indicate a genuine problem in the Windows Event Logs has a reputation of being somewhat of a black art.
Over the last decade, through our experience we have built up a library of the specific combinations of events and scenarios which actually require action, so our monitoring systems will only flag up events which indicate failures or potential failures, and present them in a format which is colour-coded according to severity, so that appropriate action and priority can be performed by our helpdesk.
Our helpdesk and monitoring teams are closely integrated, so that any new events of interest are found as new operating systems, applications and features are released, can be added to the standard template of alerts applied to all our managed servers.
Any mail queues over a certain threshold or delivery issues at the smarthost level.