Configuring auto-baselining

Auto-baselining provides an easier configuration of alerting thresholds. You can configure the system to automatically determine threshold values based on historical values for selected metrics.

– Example: It will no longer be necessary to create separate Alert Profiles to avoid generating alerts for devices that exhibit non-default metric values.

– Example 1: Devices that always have high CPU utilization will not generate alerts.

– Example: An interface that is normally down will generate an alert if it is now up.Auto-baselining features

Default Thresholds can be configured to use a specific algorithm for determining violations.

– Metric values are compared to the explicitly specified threshold values.

– Threshold values are automatically determined based on past values for the metric.

– If either the static or dynamic threshold is violated, then an alert will be issued.

– If both the static and dynamic threshold are violated, then an alert will be issued.

Individual thresholds in the non-default profiles can also be configured to use a specific algorithm.

Detailed information about the reasons for an alert has been added to the Alerts page.

Significant changes have been made to the default thresholding and alert profile editing to support auto-baselining and both editors will use the same configuration rules.

While editing a threshold, the list of allowed operators changes depending on which algorithm is selected. The editors for the explicit threshold values is only shown if the choice of algorithm requires them. The default algorithm is Static-only.

– Metric values will be compared to the values you specified for minor, major, and critical using the specified operator.

– For thresholding of enumerated type metrics, only the == or != operators should be used. Other operators are included only for backward compatibility.

– Metric values will be compared to the automatically computed thresholds using specified operator.

The “!=” is a special case that will generate alerts if the metric value is above or below the expected value.

The severity of the alerts will be based on how far the metric value is from the expected value.

– For thresholding of enumerated type metrics, only the != operator is allowed and only “minor” alerts will be generated.

– When the algorithm is set to “Static or Dynamic” or “Static and Dynamic,” the specified operator will be used for both types of comparisons:

Static: The specified operator will be used for comparison to the thresholds you specified for minor, major, and critical.

Dynamic: The specified operator will be used for comparison to the automatically computed thresholds.

– For “Static or Dynamic” and “Static and Dynamic” thresholding of enumerated type metrics:

Dynamic: The != operator will always be used regardless of which operator is specified.

Only the == or != operators should be specified. Other operators are included only for backward compatibility.

The threshold configuration settings allow you to implement various strategies for generating alerts. Each algorithm has specific types of conditions that it can detect, and you can select different algorithms on a metric-by-metric basis to best meet your goals. Here are some general guidelines:

– Suitable for alerting on tangible resource depletion and persistent conditions.

– Example: Detecting disk utilization increases to a high level or stays at a high level.

– Suitable for alerting on any anomaly while disregarding persistent conditions.

– Example: Detecting interface status changes while ignoring interfaces that are always down.

– Suitable for alerting on even minor anomalies while also alerting on persistent conditions.

– Example: Detecting moderate changes in interface utilization and detecting interface utilization that is currently high.

– Suitable for alerting on major anomalies while ignoring persistent conditions.

– Example: Detecting bursts in CPU utilization to high level but ignoring CPU utilization if it is always high.

In general, the “Static and Dynamic” algorithm will lead to the fewest alerts since both the static and dynamic thresholds need to be exceeded for an alert to be issued. A list of suggested settings for each metric is given in the next section.

For you to get experience with how auto-baselining will work in your environment, we recommend that a trial configuration be used first. After verifying that the new configuration is working as expected, the trial configuration can be put into full operation.

2. Name it appropriately and make it Active.

3. Change the algorithm selection as desired to use auto-baselining. The following figure shows the recommended algorithms, operators, and thresholds for each metric.

4. Add all the devices and groups to the new profile.

5. Optionally, disable the notifications for the auto-baselining profile.

6. Save the changes to the auto-baselining profile.

7. Compare alerts generated by the auto-baselining profile with the alerts generated by the default alert profile over the next few days.

– There will be fewer alerts for the auto-baselining profile since it does not generate alerts for persistent conditions, such as a device or interface that is always down.

– The list of Active Alerts can be filtered to display only the alerts generated by the auto-baselining profile by clicking the Total Alerts value for the auto-baselining profile (58 in the screen shot above).

– Examine the Reason field of the Active Alerts table to see explanations of why alerts were issued.

– Clicking the Metric value of any row in the table will show values of the metric at the time that the threshold violation was detected. In the example below, a device that was normally down is now up and an alert was issued.

– The Historical Alerts Viewer can also be used for verifying the alerts generated by the new profile.

8. Since enabling auto-baselining does put added load on the system, you need to verify that it is still functioning properly. The simplest way to check for auto-baselining overloading the system is view the Kafka lag for the “poller-process-data-new-1” topic. This can be found on the NetIM Infrastructure page. Verify that the values of the lag are not increasing over time.

9. After verifying that the system is performing well with the auto-baselining profile, configure it to be the active profile:

– Configure the auto-baselining profile as the health profile for the system.

10. Verify that the system is still performing well.

If this is a brand-new installation, there will not be historical data to use for calculating thresholding dynamically. However, the system will get better at determining these thresholds as time goes by and more historical data is available.

NetIM has a mechanism to quickly disable all dynamic thresholding. You can use this command if:

• there are many alerts generated by the dynamic threshold and you just want to quickly remove them.

• the system appears to be overloaded by the amount of additional processing due to the dynamic thresholding.

To access the setting go to CONFIGURE > All Settings > General Settings and uncheck the Dynamic Thresholding check box.