Alert Reference

Objective

This document provides reference information on various types of alerts supported by Volterra. Use the information provided in this document to understand the details on the various alerts and action required to be performed.

Key Points

The following apply to Volterra alerts:

  • There is no separate alert for health score. This is because health score is composed of multiple components. For example, health score of a site is computed based on the data-plane connection status to the Regional Edge (RE) sites, control-plane connection status and K8s API server status in the site. There are individual alerts defined for each of the above conditions, but no alert is available for the health score itself.

Note: You can obtain the healthscore of a site in VoltConsole. You can also obtain it using the API https://www.volterra.io/docs/api/graph-connectivity#operation/ves.io.schema.graph.connectivity.CustomAPI.NodeQuery with "field_selector":{"healthscore":{"types":["HEALTHSCORE_OVERALL"]}}.

  • The amount of time before alert generation is not the same for all alerts. This duration is determined based on the severity of the alerts. For example, alert is raised as soon as the tunnel connection to RE goes down, whereas health check alert for a service is raised only if the condition persists for 10 minutes. This is to keep the alert volume under manageable level and not to generate alerts on temporary or transient failure conditions.
  • It is not supported to change the threshold for alerts.
  • Volterra does not support users to define new alerts using an API. However, in case existing alerts do not satisfy your requirement, you can create a support request for new alert in VoltConsole.

Alerts & Descriptions

The following table presents alerts and associated details such as group, type, severity, and associated actions.

Alert Name Type Group Severity Description Action
CaptchaChallengeFailure CAPTCHA Challenge Failure event Security major CAPTCHA challenge failed. Consider blocking the relevant users/IPs using FastACL, Network Policy or Service Policy.
ErrroRateAnomaly Error Rate Anomaly custom Timeseries-Anomaly minor Error rate anomaly detected. Metric looks abnormal and needs attention.
JsChallengeFailure JS Challenge Failure event Security major JS challenge failed. Consider blocking the relevant users/IPs using FastACL, Network Policy or Service Policy.
KubeAPILatencyHigh K8S API Error metric IaaS-CaaS minor Kubernetes API latency at 99th percentile is too high for more than 2 seconds. Possible iterminent problem which may occur during parallel application updates. Check HW utilization of CE site. If persist for longer than hour contant support.
KubeCronJobRunning K8S Job Too Long metric IaaS-CaaS minor Kubernetes CronJob running for more than hour. Job can be stuck or it is expected to run longer. Check logs from Kubernetes Pod. Contact support in case of non-customer vk8 workload.
KubeDaemonSetRolloutStuck K8S Daemonset Error metric IaaS-CaaS minor Kubernetes DaemoSet desired Pods are not scheduled or ready. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s DaemonSet.
KubeDeploymentGenerationMismatch K8S Deployment Error metric IaaS-CaaS minor Deployment generation does not match, this indicates that the Deployment has failed but has not been rolled back. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeJobFailed K8S Job Failed metric IaaS-CaaS minor Kubernetes Job failed to complete in last 2 hours. Check Kubernetes Job and Pod status, events and logs in vK8s cluster. Contact support in case of etcd job.
KubeMetricsMissing Kubernetes Metrics Missing metric Infrastructure critical Essential Kubernetes metrics are missing. Other alerts may affected as well. Check kube-state-metrics workload status and logs if running.
KubePersistentVolumeSpaceLow K8S PVC Error metric IaaS-CaaS minor Kubernetes PersistentVolumeClaim is getting out of space. Resize PVC or clean disk.
KubePodCPUThrottlingHigh K8S Pod CPU Throttled metric IaaS-CaaS major Kubernetes Pod container is throttling it's CPU limits. Increase flavor for vk8s Deployment or StatefulSet definition. Contact support in case of non vk8s Pod.
KubePodContainerTooMuchMemory metric IaaS-CaaS critical More than 90% of allowed memory is being used by container. Add more replicas.
KubePodCrashLooping K8S Pod Crashing metric IaaS-CaaS minor Kubernetes Pod container restarting often. Possible causes can be out of memory limit (OOM), liveness probe or container entrypoint failure. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubePodNotReady K8S Pod Not Ready metric IaaS-CaaS minor Pod has been in a non-ready state for more than 10 min. The reason might be readiness probe failures, scheduling due out of quotas or broken node. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeStatefulSetReplicasMismatch K8S StatefulSet Error metric IaaS-CaaS minor Kubernetes StatefulSet has not matched the expected number of Pod replicas for longer than 15 minutes. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeVersionMismatch K8S Internal Error metric Infrastructure minor There are different versions of Kubernetes components running. This can be caused by failure during Volterra Software Upgrade. Check Volterra Software Upgrade status. Ignore if upgrade is in progress.
LoggingRetriesFailed Log Collection Error metric Infrastructure critical Log collection has failed to forward logs to RE site for more than 15 minutes. Check network connectivity between CE and RE site.
MaliciousUserDetected Malicious User Detected event Security major Malicious user detected. Consider blocking the relevant user using FastACL, Network Policy or Service Policy.
NodeAideFilesChanged Node Error event Infrastructure critical Monitored files on filesystem were unexpectedly modified. Use logs to verify which files were modified and why.
NodeFilesystemFilesFillingUp Filesystem runs out of files metric Infrastructure critical Filesystem at node is predicted to run out of files within the next 8 hours. Check disk usage at Site dashboard. Deprovision workload or add new node into site.
NodeFilesystemOutOfFiles Node Filesystem Error metric Infrastructure minor Filesystem at node has only a few percent available inodes left. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE. Contact support in case problem persist.
NodeFilesystemSpaceFillingUp Node Filesystem Error metric Infrastructure minor Filesystem at node is predicted to run out of space within the next 24 hrs. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE.
NodeLoadHigh Node Load High metric Infrastructure minor Node has higher load than 1 per CPU for more than 10 mins. Add new node into site or deprovision workload.
NodeNicMgmtDegraded Node NIC Error event Infrastructure critical Management NIC configuration issues detected on node. Check the network connectivity.
NodeNicTxTimeout Node NIC Error event Infrastructure critical Node network TX timeouts detected. Check the network connectivity.
NodeNotReady K8S Node Error metric Infrastructure critical Site node is down. Pods cannot be scheduled or deprovisioned since node is not responding. Check Node and HW status in console UI. Reboot node. If problem persist for longer than 1 hour contact support.
NodeOOMKilledProcess Node Error event Infrastructure critical Process was terminated by Kubernetes memory limit. Increase memory limits for the failing workload.
NodeTooManyPods K8S Node Error metric Infrastructure minor Number of pods running near maximum. Add new node into site or deprovision workload.
NodeUSBDeviceConnected USB Device Detected event Infrastructure major New USB device connected to the node. No action required.
NodeUSBDeviceDisconnected USB Device Disconnected event Infrastructure major USB device disconnected from the node. No action required.
RequestRateAnomaly Request Rate Anomaly custom Timeseries-Anomaly minor Request rate anomaly detected. Metric looks abnormal and needs attention.
RequestThroughputAnomaly Request Throughput Anomaly custom Timeseries-Anomaly minor Request throughput anomaly detected. Metric looks abnormal and needs attention.
ResponseLatencyAnomaly Response Latency Anomaly custom Timeseries-Anomaly minor Response latency anomaly detected Metric looks abnormal and needs attention.
ResponseThroughputAnomaly Response Throughput Anomaly custom Timeseries-Anomaly major Response throughput anomaly detected. Metric looks abnormal and needs attention.
SSOCreated SSO Provider Created event UAM major New UAM SSO provider was created. No action required.
SSODeleted SSO Provider Deleted event UAM major Existing UAM SSO provider was deleted. No action required.
ServiceClientErrorPerSourceSite Virtual Host Client Error metric Virtual-Host major More than 10% of the requests from site to service failed due to client error. Some clients are sending invalid requests to the virtual-host. Consider blocking the relevant users/IPs using Volterra Policy features.
ServiceEndpointHealthcheckFailure Endpoint healthcheck failure metric Virtual-Host minor Healthcheck failed for virtual-host endpoint. Check the health of the origin servers. Check connectivity of origin servers to Volterra.
ServiceServerErrorPerSourceSite Virtual Host Server Error metric Virtual-Host major ServiceServerErrorPerSourceSite Proxy is seeing excessive errors from upstream origin servers. Check the health of the origin servers. Check connectivity of origin servers to Volterra.
SiteCPUOvercommit Site CPU Overcommit metric Infrastructure minor Site has overcommitted CPU requests for Pods, failure may cause Site disruption. Increase capacity by adding a Node or Reduce Pod workload.
SiteCertificateExpiration K8S Client Certificate Error metric Infrastructure minor Kubernetes certificates is expiring for your Volterra Site. In order to avoid interruption, upgrade to latest available Volterra Software Version. Upgrade Volterra Software Version to latest available.
SiteCertificateExpiration K8S Client Certificate Error metric Infrastructure major Kubernetes certificates is expiring for your Volterra Site. In order to avoid interruption, upgrade to latest available Volterra Software Version. Upgrade Volterra Software Version to latest available.
SiteCustomerTunnelInterfaceDown Customer Tunnel Interface Down metric Infrastructure major Connection from CE to a single RE is down. Some functionality will be limited. Check physical and network connectivity of the CE.
SiteDeleted Site Deleted event Infrastructure critical Entire site was deleted. No action required.
SiteHttpProbeDown RE to Customer Site Tunnel Down metric Infrastructure major HTTP check from connected Regional Edge to Customer Edge has failed.' Check the network connectivity.
SiteHttpUnhealthy Remote HTTP check failed metric IaaS-CaaS major Communication with Volterra services at site is failing. Check the network connectivity.
SiteMemoryOvercommit Site Memory Overcommit metric Infrastructure minor Site has overcommitted RAM memory resource requests for Pods and cannot tolerate any node failure. Add new node into site or deprovision workload.
SiteNodeHeartbeatMissed Site Heartbeat Down metric Infrastructure major Node at site did not send heartbeat for more than 20 minutes. Check network connectivity and power status of node in Site. If running, trying rebooting the node.
SitePhysicalInterfaceDown Physical Interface Down metric Infrastructure critical One of the physical interfaces of CE went down. Check physical and network connectivity of the CE.
SitePhysicalInterfaceDown Physical Interface Down event Infrastructure critical Physical interface on node is down. Check the network connectivity.
SiteRegistrationApproved Site Registration Approved event Infrastructure major Site registration was approved and waiting for configuration. Check registration object for failure.
SiteRegistrationDeleted Site Registration Deleted event Infrastructure major The site node registration was deleted. No action required.
SiteRegistrationDuplicateName Site Registration Duplicate Name Error event Infrastructure major Cannot register node with given name, the same name is already registered. Choose different node name.
SiteRegistrationPending Site Registration Pending event Infrastructure major Site registration is in pending state. Check registration object for failue.
SiteSSHFailedLogin SSH Failed Login event UAM major Failed SSH login to node detected. Validate access with respect to your internal security policies.
SiteSSHLoginWithLockOutCert SSH Login with Lock out Cert event UAM critical SSH login to node with lock out cert detected. Validate access with respect to your internal security policies.
SiteSSHPasswordLogin SSH Password Login event UAM critical SSH login to node using password authentication detected. Validate access with respect to your internal security policies.
SiteSSHPubkeyLogin SSH Pubkey Login event UAM major SSH login using key to node detected. Validate access with respect to your internal security policies.
SiteSudoExecuted Sudo Command Executed event UAM major Priviledged command execution at node detected. Validate command with respect to your internal security policies.
SiteTunnelConnectionDown IPSec/SSL Tunnel Connection Down event Infrastructure critical IPSec/SSL tunnel connection to the site is down. Check the network connectivity.
SiteTunnelInterfaceDown Tunnel Interface Down metric Infrastructure critical Connection from both REs to CE are down. Majority of functionality will be impacted. Check physical and network connectivity of the CE
SiteUpgradeFailing Site Upgrade Failing metric Infrastructure critical Volterra software upgrade is failing at Site. It retries every 10 minutes and keeps updating the status. Check Volterra Software status message info. Contact support if problem persist for more than 30 minutes.
UserCreated User Created event UAM major New UAM user was created. No action required.
UserDeleted User Deleted event UAM major Existing UAM user was deleted. No action required.
UserUpdated User Updated event UAM major Existing UAM user was updated. No action required.
ViewActionError View Action Error event IaaS-CaaS major View action finished with error. Check the validity of your view variables.
VoltShareDecryptionError VoltShare Decryption Error metric VoltShare major Decrypt operation has failures. Check secret policy or admin policy.
VoltShareEncryptionError VoltShare Encryption Error metric VoltShare major Encrypt operation has failures. Check secret policy or admin policy.
WafTooManySecurityEvents Security Events metric Security major Virtual Host WAF security events detected. Consider blocking the relevant users/IPs using FastACL or Network Policy or Service Policy.

TSA Severity vs Anomaly Scores

The following table presents the reference table for the Time-Series Anomaly (TSA) scores and associated severity of the alerts related to various metrics. The table also shows the absolute threshold for the associated metrics.

Metric Severity Score Absolute Threshold
Request Rate minor 0.6 NA
Request Rate major 1.5 50 rps
Request Rate critical 3.0 100 rps
Request Throughput minor 0.6 NA
Request Throughput major 1.5 2500 kbps
Request Throughput critical 3.0 5000 kbps
Response Throughput minor 0.6 NA
Response Throughput major 1.5 25000 kbps
Response Throughput critical 3.0 50000 kbps
Response Latency minor 0.6 NA
Response Latency major 1.5 250 ms
Response Latency critical 3.0 500 ms
Error Rate minor 0.6 NA
Error Rate major 1.5 5 erps
Error Rate critical 3.0 10 erps

Note: For more information on the Volterra TSA, see Time-Series Anomaly Dectection guide.