Administering NetIM : Checking system status
  
Checking system status
NetIM supports a System Status page and a NetIM Infrastructure page that provide detailed status on the NetIM deployment on Docker swarm.
System Status page
To work with the System Status page
1. Log in to the UI.
2. Do one of the following:
Click the LED indicator in the upper left of the title bar:
Choose Help > System Status. The System Status screen appears.
3. From this page you can check on the status of Core Services and Swarm Services, download logs, download incident logs if they exist, and drill-down on Swarm Services of interest for additional information.
Users who do not have admin privileges can only view status; other operations like downloading logs are grayed out.
Each service regardless of core and swarm is shown with an LED indicating its current health status. Mouse over the LED for additional status. In addition to service health status another important function of the System Status page is the ability to perform log aggregation and download.
Core service logs and core incident directories can be harvested and downloaded directly from the WebUI. For core service logs, there are essentially 2 options.
Basic: just the core service logs
Include supporting files: includes the core service logs and everything under the log directory, except the incident directories
For Core Incident Directories, you are given the choice of which incident directories (if any exist) to download from a list of the most recent 10 detected core incidents.
When you click Submit, the logs are streamed over to your browser and you are provided with progress using an indication of the amount of data downloaded.
Swarm Service logs can be aggregated and downloaded or specific service logs can be harvested and downloaded. Generally, you will likely use per-service log download because an issue is almost always confined to a single microservice.
When downloading a swarm service log you will be asked the time-frame from which you want to harvest the logs (defaults to current time) and the number of lines you want to download (defaults to the last 100 lines). This allows you to pinpoint a time in history where you wish to download and view logs as opposed to downloading all logs.
In addition to downloading and viewing swarm logs, you can also drill down to Service Metrics directly from the service as well as view service and task details. Service and Task Details are generally used only by advanced Support personnel.
NetIM Infrastructure page
The NetIM Infrastructure page allows the admin user to monitor and troubleshoot the NetIM Docker swarm or virtual deployment, and incorporates the System Status page in the Services tab.
The NetIM Infrastructure page provides visibility into the internals of NetIM and the platform on which you are hosting NetIM. The NetIM Infrastructure page provides curated views into metrics related to the underlying infrastructure as well as NetIM internals. By default, NetIM stores 15 days of 30-second metrics for viewing current and past metric state. You control the time-range as you would any other NetIM page using the time-selector in the upper-right side of the page.
The NetIM Infrastructure page and related tabs provide a wealth of information for monitoring your NetIM deployment. The tabs are arranged in order of increasing detail and complexity. In general, these pages assist with monitoring the health of the underlying infrastructure supporting your NetIM implementation and provide a high-level view into some of the key internals of the NetIM application.
You can view a concise table of the NetIM swarm nodes including their names, roles, IP address, availability, and configuration. You can also view the uptime for each of the nodes in your NetIM deployment.
Some rules-of-thumb to consider when viewing NetIM Infrastructure tabs include:
Node Summary:
NetIM’s Manager should consistently be the most heavily used component of your NetIM implementation. High utilization of memory and CPU on the Manager is expected. If your internal policies require lower utilization of memory and CPU, consider increasing the CPU and memory available to NetIM Manager.
Node Details:
“Holes” or gaps in system-level metric data in the Node Details tab are a likely indicator of snapshots and/or insufficient resources available to harvest metrics. Using the metric graphs, you can pinpoint the time it occurred and raise it to your VM team.
“Docker Daemon Container Actions” should generally be a flat-line unless you are actively starting/stopping services.
“System Load” should ideally trend to be close to the number of cores/CPUs allocated to the VM. Given NetIM Manager is uniquely loaded, you may want to provide more CPU resources to NetIM Manager.
Key Metrics:
Kafka “Messages In” and “Messages Consumed” should generally be equal and the graphs look roughly identical.
Kafka Lag may go up and down and even report as negative at times. Lag should never be consistently monotonically increasing; it should trend towards zero.
“Tasks dropped” should generally be zero. Significant tasks being dropped indicate a need to scale the pollers by increasing workers.
Log Events (Error/Warning Count) metric can be an indicator of where to look and when to review logs.
To work with the NetIM Infrastructure page
1. Log in to the UI as admin.
2. Choose Configure > All Settings > Administer > NetIM Infrastructure. The NetIM Infrastructure page appears.
3. From the NetIM Infrastructure page you can view the following information for every NetIM infrastructure node:
Node Summary tab (default)
The Node Summary tab provides a high-level semi-circle chart display of the following, with drill-down to the Node Details view for a specific node:
CPU Busy
RAM Used
SWAP Used
Data File System Used
Root File System Used
Node Details tab
The Node Details tab displays the same semi-circle charts as the Node Summary Tab, as well as the Node Details link for each node in the swarm that takes you to the Node Details tab which displays more detailed line charts with mouse-over information, filters in the footers, and show/hide options in the action menus, providing system-level summary and detailed metric charts per node in your NetIM implementation:
CPU
Memory
Network Traffic
Disk Space Used
CPU Usage by Container
Memory Usage by Container
Received Network Traffic by Container
Transmitted Network Traffic by Container
File System Reads by Container
File System Writes by Container
Disk IOs
Docker Daemon Container Actions
Processes Status
System Load
You can also display the Service Tasks by clicking the Show/Hide Service Task link in the upper right of the page and then drill down further on the Service Task of interest.
Alerts tab
The Alerts tab displays active alerts related to the NetIM infrastructure.
Key Metrics tab
The Key Metrics tab displays high-level metrics related to the key NetIM operations including kafka, polling jobs, and indications if any services are generating error or warning messages in logs. The charts tracking the following key metrics, with mouse-over information, filters in the footers, show/hide options in the action menus:
Kafka Lag
Messages In
Messages Consumed
Log Events
Tasks Dropped
Service Metrics tab
The Service Metrics tab displays detailed metrics that can be filtered to a specific NetIM Service for specific service monitoring. The chart track the following service metrics, with mouse-over information, filters in the footers, and show/hide options in the action menus:
I/O Rate
I/O Errors
I/O Duration
CPU Usage
I/O Thread Utilization
JVM Heap
JVM Non-Heap
Threads
Thread States
Log Events
File Descriptors
Garbage Collections
Garbage Collection Pause Duration
Elastic Metrics tab
The Elastic Metrics tab displays charts tracking the following elastic metrics, with mouse-over information, filters in the footers, and show/hide options in the action menus:
Cluster Stats
CPU Usage
JVM Heap Used
JVM Heap Committed
Search Latencies
Indexing Latencies
Documents
Index Data Size
Open Files
Disk Free
Services tab
The Services tab is provided as a convenience so that you can remain within the NetIM infrastructure page for access to the service status and log downloads. You can view if any of the swarm services have encountered incidents. You will find a column indicating whether out-of-memory incidents were detected, and you can download the HPROF file to provide to Riverbed Support for analysis.
This tab provides the same view and features as the NetIM’s System Status page displays. For more information, see System Status page.
Containers tab
The Containers tab provides a detailed list of containers,
Swarm Details tab
The Swarm Details tab provides Riverbed Support with a direct way to view and inspect the NetIM docker swarm images, docker volumes, and docker networks associated with your deployment, and when troubleshooting issues stay within the UI rather than rely on external troubleshooting tools like Portainer.
Advanced tab
The Advanced tab supports troubleshooting problems and consists of the following three tools:
Model Mapping Troubleshooter—Allows you to drill down on issues and review suggested remediations.
Prometheus PromQL Query—Allows you to the use PromQL (Prometheus Query Language) to selects and aggregate time series data in real time. For more information, see the Prometheus documentation.
Swarm API Request and Response—Allows you to create a variety of API calls to test the swarm.