The entire system generates detailed telemetry for all the features that can be consumed across Volterra services. This telemetry provides observability of infrastructure, applications, connectivity, and security services across a distributed environment and allows netops, devops, and application teams to troubleshoot and optimize their applications without additional burden on application developers. There are four types of telemetry data that is collected from the distributed system - metrics, logs, alerts, and events. Some of these logs, metrics, and events that are also used for post-processing to determine anomalies, analyze application APIs, security issues, create graph visualizations, etc.
This telemetry data provides different outcomes to different types of users:
- Volterra SRE- our site reliability engineers goal is to ensure that customer services and our global infrastructure are operational and meeting the service level objectives
- Customer Operations - based on RBAC and policy configured, a significant amount of data can be consumed by the central operations teams for observability of their infrastructure, network, applications, and end-users of their applications. There is a rich amount of visibility available on the VoltConsole for instant visualization of this data as well as APIs available that can be used to integrate with other tools
- Customer Application Teams - depending on the RBAC and policy configured, the application team will be able to get observability of application and network services that relate to their specific applications
- Third Party Integrations - There are many cases where certain logs and metrics needs to be sent to external systems for compliance, end-to-end visibility, alerting, etc. Some good examples are ServiceNow, Pagerduty, Splunk, NewRelic, AppDynamics, and DataDog, etc. Our APIs can be used to integrate most of the external systems that are commonly used today.
If you’re interested in further details of how the features described in this guide work. You can find out more about Volterra’s observability architecture in Concepts section.
Intro to Observability
There is a complex and distributed system to collect logs, metrics, alerts, and traces from our global infrastructure as well as each of the Volterra Nodes deployed across users cloud and edge locations.
From a user point of view, there are two methods to get observability into their applications and services deployed across multi-cloud, network, and edge sites - use the Volterra SaaS portal for centralized dashboards or use Volterra APIs to integrate with 3rd party tools. There are the four different types of telemetry and observability data that is collected from distributed sources and aggregated by the system:
- Metrics - There are many time-series metrics for the Infrastructure (cpu, memory, disk, interfaces, connectivity, and latency), Applications, and Application Services (deployment status, application health, request rate, errors, duration, latency, and throughput) that are collected by the system.
- Logs - There are three types of logs that are aggregated across the system - system logs, application logs, and access logs (request and response). The applications logs are currently not automatically stored by the system and the user needs to decide how to handle its storage.
- Alerts - Alerts can be related to user services (eg. application restart, site connectivity lost, out of memory, etc) or infrastructure services (volterra service restarted, connectivity errors, etc). All of these alerts are available in the dashboard and using the APIs can be integrated to external system like Pagerduty. Some of the alerts relating to infrastructure services are handled and mitigated automatically by Volterra SRE team and does not require customer to worry about them.
- Events (Audit Logs) - These logs record an event relating to access and change of configuration resources. These are security related chronological records that can be used to identify who, when, and what changes to the configuration of an object were made.
Many of these logs and metrics are used for post-processing to determine anomalies, analyze application APIs, security issues, create graph visualizations, etc. For example, these metrics are also used to generate a health-score for sites as well as applications, determined based on statistical analysis of the metrics.
Metrics, Logs, Alerts, and Events are automatically stored by Volterra for each tenant and is available for default of 14-days of retention. If the user needs additional retention period, there is the capability to extend this time-period. Audit logs are retained for 6-months as there may be regulatory and compliance needs for longer retention.
The above observability data is available to the user through two mechanisms:
VoltConsole - Using a web-browser and credentials, the user can access various dashboards and graphs relating their infrastructure and applications.
- In the Infrastructure (system) namespace, you can get visualizations like Site Map, Site Connectivity, Site Dashboard, etc.
- In the respective Application namespaces, you can visualizations like Application Sites, Application Deployments, Virtual Host Dashboard, Service Mesh Graph, Security Dashboard, Application Traffic Graph, etc.
- Volterra APIs - There are APIs to collect infrastructure and application metrics, logs, events, and alerts. In addition, there is a graph query API that provides metrics for interactions across services. These APIs can be used to interface with external systems like Splunk or Datadog that may be used within the enterprise.
The following concepts are used for Volterra’s observability features. Click on each one to learn more: