Monitoring

The entire system generates detailed telemetry for all the features that can be consumed across Volterra services. There is a distributed system to collect logs, metrics, alerts, and traces from our global infrastructure as well as from each of the Volterra Nodes deployed across users cloud and edge locations. Some of these logs and metrics are also used for post-processing to determine anomalies, analyze application APIs, security issues, create graph visualizations, generate healthscores, etc.

In addition to monitoring APIs, there are many different dashboards and graphs in the VoltConsole that can be viewed by the customer teams (operations or application) or using the browser, eg Site Map, Site Connectivity, Site Dashboard, Application Deployments, Virtual Host Dashboard, Service Mesh Graph, Security Dashboard, etc. We will not cover all of them here as they keep evolving with the product.

Metrics

Infrastructure Metrics

As a SaaS Provider, it is paramount for Volterra to monitor the health of the infrastructure, which not only includes our global (physical) infrastructure, but also customers physical (edge locations) and virtual infrastructure (in the cloud environment). The health of this infrastructure greatly impacts the performance of the application. For example, packet drops between two sites due to network congestion may result in an increase in error rate and the latency between the services deployed across these sites and therefore could impact the overall health of the application. The metrics used for infrastructure monitoring include (but not limited to) the following:

  • CPU, memory and disk utilization of a site
  • CPU, memory and disk utilization per container
  • Status of physical and virtual interfaces (ipsec tunnel)
  • In/Out throughput and drops per interface
  • Latency between the sites

The infrastructure metrics have the tenant and site labels associated with it in addition to the other labels that depends on the component being monitored. For example, interface metrics have interface_name and interface_type in addition to tenant and site labels; container metrics have container_name, pod and instance labels.

Application Metrics

At Volterra, we use RELT (Request rate, Error rate, Latency, Throughput) method to monitor our applications that powers the Volterra Edge Cloud Platform. The same principle/methodology is used to monitor the Customer applications deployed in the Volterra Edge Cloud. The RELT method is an extension of the RED (Request rate, Error rate, Duration) method that is typically used to monitor the micro-services. Typically, customers define SLAs (Service Level Agreements) and KPIs (Key Performance Indicators) for the applications using one or more of these 4 metrics. Capacity planning is also a function of Request rate, Error ratio (Error rate/Request rate) and Latency. The application metrics have the following labels associated with it to provide different levels of observability.

  • tenant - unique identifier for a customer
  • namespace - identifies an administrative domain within a tenant. Multiple instances of the same application can be deployed across multiple namespaces with a tenant
  • apptype - an application can have one or more services represented by one or more virtual hosts. The interaction between the services in an application can be viewed as a service mesh. Service mesh graph not only provides visibility into the east-west traffic flowing between the services that belong to an apptype, but also the north-south traffic originating from the client.
  • virtual_host - unique identifier that identifies a virtual service, API Gateway or a Load Balancer.
  • src_site - source site where the request enters the Volterra Edge Cloud.
  • src - identifies a network in case of north-south traffic and virtual-service in case of east-west traffic.
  • dst_site - destination site where an instance of the virtual service is hosted which services the request.
  • dst - virtual service (server) that handles the request from the client.

Each application metric can be aggregated across various labels and therefore enable observability for different teams responsible for monitoring and enforcing SLAs at different levels.


Graph Query APIs

Since applications and services may be distributed, it may not be sufficient to monitor individual services but also establish relationships between these inter-dependent services. Hence, we have come up with a mechanism to represent these dependencies as a graph. Nodes of this graph are services (or sites, etc) and the interactions between them are edges. Data can be retrieved for set of nodes, edges, or set of nodes+edges using our Graph Query APIs. Graph APIs also avoids multiple API calls and each API call can be done for a given time-range.

Site Connectivity Graph

Site Connectivity is an interesting visualization of sites and service deployments across various cloud or edge locations. It can also be a good starting point to drill-down to access dashboards, etc.

SiteConGraph
Figure: Site Connectivity Graph

A tenant typically owns one or more sites and each site is connected to two Regional Edges (for redundancy) using a secure tunnel (for example, IPSec or TLS). Connectivity between the sites is represented as a graph, where each node represents a site (volterra site-RE or customer site) and the edge represents the connectivity between the sites. Each node and edge in the connectivity graph has various metrics associated with it. The value of each metric is calculated based on the start time and end time in the connectivity graph request. Node metrics include the following:

  • CPU, memory and disk utilization per node in the cluster
  • Number of application deployments and associated pods
  • Operational status of all the physical and tunnel interfaces
  • In/Out throughput and drop rate for the site and per interface

Edge metrics include the following:

  • Reachability between the sites.
  • Latency
  • In/Out throughput and drop rate.

Each node and edge has a healthscore associated with it and some or all metrics may contribute to the overall healthscore of the node and edge. For example, a node with high drop rate or cpu, memory utilization that crosses the threshold will have lower healthscore. Similarly the reachability score of the edge (calculated based on successful probes), high latency will impact the edge healthscore.

Note that even though all the REs are connected in a mesh topology, the connectivity between the REs is not shown in the customer view on the VoltConsole as the REs are owned by ves.io tenant that belongs to Volterra.

Details on all the metrics that can be monitored for these site objects is covered in the API Specification.

Service Graph

An application can have one or more services represented by one or more virtual hosts. The interaction between the services in an application can be viewed as a service mesh graph, where each node represents a service and an edge represents the interaction between the services. Service mesh graph not only provides visibility into the east-west traffic flowing between the services that belong to an app_type, but also the north-south traffic originating from the clients. Service mesh graph provides (RELT) metrics for each node (service) and edge (interaction between the services).

In the following service graph for an application, C1 and C2 are the clients and S1, S2 and S3 are the servers. The direction of the arrow indicates the flow of requests between the services.

SrvcGraph
Figure: Volterra Service Graph

Details on all the metrics that can be monitored for these graph objects is covered in the API Specification.


Logs

Access Logs (Request & Response)

Access log is a record of a request and response across a virtual host. Access logs are extremely important in troubleshooting in used in addition to application metrics to help spot an issue.

Access logs contain more context (URL, user-agent, request body, etc) that is lacking in the metrics due to cardinality issues and therefore it helps answer who did what and when and how did that impact the application or overall system? In a distributed system, failures are rarely confined to the misbehavior of one component and often attributed various triggers across highly interconnected graph of components (service mesh). Access logs come to the rescue to help connect the dots by correlating sequence of events to identify the failure. One can essentially replay events in sequence found in the access logs to recreate an issue in the test environment for root cause analysis. The work flow typically involves identifying the issue by the application metrics (RELT discussed earlier) through SLO violation and drill down through access logs across various components in the service mesh graph.

Access logs are used by our machine learning model to build API end-point markup that is also used to detect various endpoints in the service mesh. For each endpoint, various behavioral models can be learnt, calculate probability distribution between endpoints that enables per request anomaly detection.

Access logs not only plays an important role in identifying/fixing application misbehavior, the context in the access log is important in case the API call triggers a security event. Access logs help demonstrate to business partners and end-users that every API access can be subjected to various security rules to guard against potential security breaches and log/alert generated when a risk is identified. Access logs plays a key role in a business’ overall risk management strategy as it is often the basis for security and forensic analysis and therefore required to be maintained in the cold storage for longer duration for regulatory compliance.

Access logs are structured and contains the following information:

  • User
  • Application type
  • Virtual host
  • Request URL
  • Operation (POST/PUT/DELETE/GET)
  • Request body (if the request triggered security event)
  • Request length
  • Response code (Indicates success/failure)
  • Response body (if the request triggered security event)
  • Response duration
  • Source (Network where the request originated)
  • Source Instance (Country where the request originated)
  • Source Ip address
  • Source site (Site where the request originated)
  • Destination (Destination service)
  • Destination site (site where the destination service resides)
  • WAF Rules hit (list of WAF rules that matched the request)
  • WAF instance Id
  • Timestamp

These logs are available in the dashboard for a virtual-host and service-mesh. In addition, we also provide an API to query access logs. This API accepts various criteria/match conditions in the request to fetch appropriate access logs scoped by tenant and namespace. For example, one can query for all denied requests destined for a virtual host that originated from a specific country between a given start time and end time.

Details on Access Logs is covered in the API Specification.

Application Logs

Applications are deployed on VoltStack using Volterra’s kubernetes service and applications (VMs and containers) are instantiated as Kubernetes Pod resources. Application logs can be queried using the kubectl CLI tool or using the Kubernetes corev1 Pod Read Log API from the Virtual Kubernetes API endpoint.

Audit Trail Logs

Volterra provides public APIs to track the create, modify, delete and access to configuration objects in the system. Audit logs provide answer to “who” did “what” and “when”. Audit logs also indicate whether the activity was successful. This helps in post-mortem and analysis when something goes wrong or an activity or a set of activities caused the system to misbehave. Even in the normal course of operation, administrator may want to get insight into all unsuccessful events or generate daily report on all user activities which can be fed to the AI model to detect anomalous user behavior. Alerts may be generated upon detection of anomalous user behavior.

Audit logs are structured and contains the following information:

  • User
  • Request URL
  • Operation (POST/PUT/DELETE/GET)
  • Request body
  • Request length
  • Response code (Indicates success/failure)
  • Response body
  • Response duration
  • Source (Network where the request originated)
  • Source site (Site where the request originated)
  • Destination (Destination service)
  • Destination site (site where the destination service resides)
  • Timestamp

These logs are available in the VoltConsole in the infrastructure (system) namespace and we also provide a public API to query audit logs. Like any other public API provided by Volterra, the audit log API is also scoped by tenant and namespace. The API takes various criteria/match conditions in the request to fetch appropriate audit logs scoped by tenant and namespace. For example, administrator may query for all “POST” operations performed by a specific user between a given start time and end time.

Details on Audit Logs is covered in the API Specification.


Notifications

Alerts

Alerts are broadly classified into four categories:

  • Configuration validation failures or operational failures. A configuration might fail because it is invalid in given site or there no resources etc.
  • Application alerts based on RELT metrics.
  • Security Alerts based on policy, WAF or anomaly detection.
  • Infrastructure alerts like site down, site connectivity down (ipsec connectivity to RE), site interface down, high cpu, memory, disk utilization in a customer site. Typically, this includes alert on any metric that is included in the connectivity graph.

These alerts are available in the VoltConsole in their respective locations (eg. site alerts in the site dashboard) and available using the Alerts API. Like any other public API provided by Volterra, the alert API is also scoped by tenant and namespace.

Details on Alerts is covered in the API Specification.

Security Events

In Volterra Edge Cloud platform, customers may configure WAF (Web Application Firewall) per virtual-host to monitor and protect their web application from a range of attack vectors such as, Protocol Header attack, Cross Site Scripting (XSS), Remote Code Execution and SQL Injection, etc. The incoming web traffic is subjected to a set of rules associated with the WAF-object before reaching the server. Based on the rule hits and the paranoia level set on the WAF-object, if the request is identified to be malicious, then the request is blocked and a security event is generated, thereby shielding the customer application.

Details on Security Events is covered in the API Specification.