April 13, 2020|Ankur Singla and Devesh Mittal

Overcoming limitations of Service Mesh for the security of distributed app clusters

Co-Author: Devesh Mittal

This is the fourth blog in a series of blogs that cover various aspects of what it took for us to build and operate our SaaS service:

  1. Control plane for distributed Kubernetes PaaS
  2. Global service mesh for distributed applications
  3. Platform security for distributed infrastructure, apps, and data
  4. Application and network security of distributed clusters
  5. Observability across a globally distributed platform
  6. Operations & SRE of a globally distributed platform
  7. Golang service framework for distributed microservices

In the previous blog, we provided insights on the challenge of using cryptographic techniques to secure our platform (infrastructure, apps, and data). This blog will deal with techniques we use to secure the platform against targeted attacks from the network — from the Internet as well as from inside. Since apps are no longer constrained to any physical location, traditional perimeter-based firewalls and signature-based security solutions are no longer effective. We will outline the shortcomings of our initial zero-trust security implementation and why + how we augmented it with machine learning and algorithmic techniques to properly secure our distributed infrastructure + app clusters.


TL;DR (Summary)

  1. Our app-to-app and user-to-app security requirements were complicated by the fact that our platform is running hundreds of microservices (across many Kubernetes clusters) along with a few monolithic apps across multiple cloud providers (AWS and Azure), our global network, and edge locations.
  2. We were also asked to build processes and deliver a robust network + app security solution that gave selective access to shared infrastructure to our developers, our operations teams, our customers’ developers, their operations teams, and their end-users. This got even more complicated by the fact that we had to meet compliance requirements like PCI-DSS, SOC, etc.
  3. While we had already built a zero-trust solution based on our global service mesh and API gateway, it did not solve many security problems like system vulnerability, resource exhaustion, internet attacks from bots/malware, and the needs of a few services that provided unauthenticated access. Also, sometimes it was difficult for our secops team to get API information from the developers — essential for a good zero-trust deployment that relies on access whitelists.
  4. Delivering to these needs would have required our platform team to either procure services from vendors/cloud providers or augment our L3-L7+ datapath (detailed here) with additional security features — network firewall, app firewall, DDoS protection, privileged access management, etc. Given the fact no existing vendor/cloud provider could solve these problems with tooling that provided unified policy, observability, and automated API discovery capabilities, we had to make the difficult decision to embark on an internal project to augment our datapath and control plane.
  5. As part of building the network + app security capabilities into the datapath, we ended up adding many new features — something that has not yet been achieved in any network datapath to date. We fused algorithmic security capabilities with machine learning on network + app traffic alongside a programmable policy engine that works across the stack — network, HTTP and APIs. As a result, we are now able to deliver different security capabilities depending on the deployment environment, traffic model and latency needs. This has made our global service mesh more secure, reduced the number of false positives for our SOC team, and given a significant reduction in latency + compute resources.

Limitations of Service Mesh & Zero-Trust Architecture

Our platform runs a large number of apps across multiple teams that operate their own clusters in edge, our global network, and AWS and Azure public clouds. While the majority of workloads are microservices-orchestrated using Kubernetes, we have a handful of very large-scale monoliths (eg. elasticsearch) that we manage using terraform. Figure 1 demonstrates the distributed nature of our platform. For example, in each of our 18+ global network PoPs (with a few tens to slightly more than a hundred physical servers), we run thousands of app pods. However, on the edge, we have individual customer deployments today with 3000+ active locations (each with one to seven computes) running a few tens of app pods.

mesh limits 1
Figure 1: Distributed Apps and Data across Volterra Platform

The platform is fully multi-tenant with each node running workloads from different customers (and our own). Since some of these apps are exposed to the public Internet, we need to ensure that all communication to/from the apps is secured. As we had outlined in the previous two blogs, we built a robust identity, authentication + authorization system along with our own L3-L7+ network datapath (VoltMesh) that is used to power our service mesh and API gateway. As shown in Figure 2, this has allowed us to deliver transport-level security across app clusters (mTLS), from users (TLS/mTLS), and employees (mTLS) as well as access control based on authentication+authorization.

mesh limits 2
Figure 2: Transport Security with Service Mesh and API Gateway

While this zero-trust implementation provides a lot of benefits, it did not automatically solve several security problems:

  1. System Vulnerability — An app may get compromised because of system vulnerability or an authorized but malicious employee — there needs to be a mechanism to detect such situations, automatically take remedial actions, and alert our SRE teams.
  2. Resource Exhaustion — There are many situations where even legitimate and authorized client access can slowly exhaust app resources because of poor design. These situations can significantly degrade performance for other users.
  3. Internet Attacks — Since a lot of our (and our customer) infrastructure and apps are directly exposed to the public Internet, these resources get continuous attacks from bad users, bots, and malware. We need to detect and protect these resources against denial of service, volumetric, and intrusion attacks in order to provide good response time to other users.
  4. Unauthenticated Services — While the majority of the apps require authentication (also needed for zero-trust), there are cases where an app cannot be restricted. As a result, these apps require additional protection using network-based solutions.

Over the last 2.5 years of development on this platform, we also realized that often our developers will incorporate open source apps, containerize them and ask our DevOps team to deploy them on the platform. However, they often lack details on API-level interactions within these apps that are needed by our security team to create policies to whitelist the communication. This is a big roadblock for our zero-trust security implementation as it mandates whitelist policies that only allow APIs used by the apps and block all other traffic. Whenever we made exceptions to this requirement, it left some apps with very basic network-level segmentation, thereby increasing the attack surface.


Augmenting our Zero-trust Service Mesh

As a result, we needed to augment our existing zero-trust security solution with additional security capabilities to handle the issues listed above. We identified a list of additional security capabilities that we had to build into the platform:

  1. API Discovery & Control — Build the ability to learn APIs in running apps and automate the generation of app/API whitelist policies without reliance on developers
  2. Anomaly Detection — Ability to detect and protect against anomalous behavior for traffic from any source type ( trusted or untrusted user/app)
  3. Behavior Profiling — Protect apps against resource exhaustion, app/API attacks (Internet or from internal network), and protect against volumetric attacks from the Internet to the infrastructure and/or apps
  4. Logging and Visualization — Ability to log and trace all access for future forensics and/or compliance needs

We decided to use a combination of traditional signature-based techniques, statistical algorithms, and more dynamic machine learning approaches to solve these problems. This required us to make changes to our SaaS backend as well as add new capabilities in our network datapath.


Deep Learning for API Discovery & Control

In order to lock-down the platform, we only allow network connections based on the whitelist of APIs for every app. This requires our security team to coordinate with developers and ensure that our programmable policy engine is fed with the right API information. We quickly realized that it was impossible for our developers to provide this information for apps that were not built using our service framework.

Since our service mesh proxy is in the network path of every access to the app, we decided to learn APIs and static resources that are exposed by the app by doing run-time analysis of every access that goes through the proxy. The challenge with this approach is to identify API endpoints by inspecting URLs and separating out components that are dynamically generated. For example, for an API “api/user/<user_id>/vehicle/”, the proxy will see accesses like:

/api/user/ec3cff89-e804–4c88-a515-a6c412355a71/vehicle/3D7KU28C04G254161
/api/user/f075fe11-af27–4883–913f-0d4f45f6ebd9/vehicle/4S3BK4355T6319316

...

There can be millions of such requests, making it very challenging to decipher. As a result, the identification of dynamic components in these related requests is done using deep learning and graph analysis. We represent the entire URL component set as a graph and then perform graph clustering to find sub-graphs with similar properties using feature sets that capture specific properties of dynamically generated components such as:

  • Structural properties
  • Entropy of nodes
  • Jaccard similarity between various parts of the graph
  • Simple string classification using well understood deep learning methods

As a result, the dynamic components get classified and output from the system looks like:

/api/user/DYN/vehicle/DYN

Using this machine learning of APIs, we can easily and automatically generate a policy that can be enforced by our service mesh proxy. Based on the API endpoints discovered, we also learn other properties like what apps use what APIs to talk with other apps, the typical behavior of these APIs, etc. This allows us to build a service graph that helps our security team to visualize service-to-service interaction for forensics, discovery and API-level micro-segmentation.


Identifying Shortcomings in Firewalls

Before we embarked on adding the remaining two capabilities (anomaly detection and behavior profiling), we decided to see if existing solutions could help us. While there are many perimeter firewalls and web app firewall products in the market, the majority of these solutions are geared towards protecting Internet-facing apps. They make certain assumptions that the traffic being serviced is web traffic and provide targeted protection for HTML, javascript, sql, CMS, etc — making it relatively easier to write signatures and rules to catch vulnerabilities and known exploits.

While this capability is important for our web traffic, we need to also serve a growing amount of API and machine-to-machine traffic in our environment. To solve this, our security team would have to write app-specific rules that don’t fall under known typical web rules (like OWASP CRS). Usually security administrators know little about the apps and with the dynamic nature of the environment, it becomes even harder to keep track of the app types and structure to write those app-specific rules. As a result, while our platform team provides this capability in our network datapath, it is not often used by our security team.

Another problem for which we have a significant amount of data from our network is that app attacks are becoming a lot more sophisticated over time. The attacker spends days performing reconnaissance to determine the nuts and bolts of the APIs, the app, underlying infrastructure, and OS type by looking at HTTP/TCP signatures, etc. Traditional signature and rules-based approaches are of very limited use in these situations and we decided to continue with our AI-based approach to automatically learn user behavior and enforce good vs bad behavior.


Machine Learning for Behavior Profiling

Most apps have certain workflows (sequence of APIs) and context (data within the APIs) to which different use cases/deployments are designed and typically followed by the users of the apps. We exploit these properties and train our machine learning algorithms to model “valid” behavioral patterns in a typical user interaction with the app.

Our datapath samples requests/responses for each API along with associated data and sends it to our central learning engine as shown in Figure 3. This engine continuously generates and updates the model of valid behavioral patterns that is then used by the inference engine running in the datapath to alert/block suspicious behavior.

mesh limits 3
Figure 3: Interaction between Learning Core and Distributed Inference Engines

The learning engine looks at many metrics like the sequence of APIs, gaps between requests, repeated requests to the same APIs, authentication failures, etc. These metrics are analyzed for each user and on an aggregate basis to classify good vs bad behavior. We also perform behavior clustering to identify multiple different sequences of “good behavior.” Let’s take an example to illustrate this:

  • The model does an analysis across many users to define a sequence of good behavior. Each color represent individual different API calls:

mesh limits 4
Figure 4: Normal sequence of APIs

  • The following sequence of APIs will get flagged by the system as suspicious/bad behavior that will be automatically mitigated by the system or generate an alert for an admin to intervene

mesh limits 5
Figure 5: Suspicious sequence of APIs

As we put this system into production over a year back, we have continuously refined the model based on usage and customer feedback. We have been able to successfully identify the following types of attacks:

  • Crawlers/scanners doing reconnaissance of an app to find vulnerabilities and also to create a map of all APIs exposed by the app
  • Sustained malicious activity by launching multiple attack vectors to vulnerable APIs
  • HTTP-level denial of service attack that leads to resource exhaustion (eg. multiple attempts to login APIs)
  • Data leakage by bots that are designed to scrape for information (eg. pricing information from an e-commerce site by competitors)
  • Brute force attacks — dictionary attacks for login/passwords
  • Identification of good bots (e.g. Google, Bing crawler)

That said, we also realized that there are some problems with this approach — it cannot uncover low and slow attacks (brute force, app denial of service, scanner) for which we need to apply anomaly detection techniques.


Algorithmic + AI-based Anomaly Detection

Sometimes, we see highly sophisticated attacks that use large distributed botnets that pass under the radar of our behavior analysis technique. Examples of such attacks are:

  • Distributed brute force attack
  • Distributed dictionary-based account takeover attacks
  • Distributed scanner to find vulnerabilities and exploit them from multiple clients
  • HTTP DoS attack from a highly distributed botnet
  • Botnet targeting a small subset of compute-heavy APIs
  • Botnet targeting a large set of data-heavy APIs to exhaust resources

Since our network datapath is collecting information from each node across our global network, it becomes relatively easy to perform analysis on a particular app’s aggregate metrics like request rate, error rate, response throughput, etc. This analysis allows us to detect distributed attacks and mitigate them (at every node) by identifying the users that could be part of such botnets. Let’s take an example where we are trying to detect anomalies across different time windows (last 5 mins, 30 mins, 4 hours, 24 hours) by looking at request rates and if the request rate is high within a given time window, then the following deeper analysis of access logs will be performed by the system:

  • Analyze if source IP address spread is higher or lower than typical. If requests came from a large number of unique source IPs, then it indicates a highly distributed attack that will trigger a more aggressive version of user behavior analysis on each source IP.
  • Run a more aggressive version of user behavior analysis on each of the source IP addresses to assign suspicion scores and take the configured mitigation actions.
  • If most requests went to a smaller number of API endpoints (than usual), it indicates a brute force attack or DoS attack. If the APIEP spread was larger than usual, it indicates a distributed crawler, scanner or DoS attack

While anomaly detection has always been an important technique for intrusion detection and prevention (IDS/IPS) in firewall appliances, these appliances are unable to mitigate global app-layer attacks. With our ability to perform API markup and learning across our global platform, we are now able to suppress attacks at the source across our distributed network.


Gains from Machine Learning for App + Network Security

While we were extremely satisfied with our zero-trust implementation based on service mesh and API gateway, we realized that it was not comprehensive to secure distributed app clusters from vulnerabilities and malicious attacks. We had to augment it with machine learning for behavior analysis + anomaly detection alongside traditional signature + rule-based techniques to provide a better security solution.

We have seen three significant gains from the addition of distributed inferences in our L3-L7+ network datapath along with learning core running in our centralized SaaS:

  1. End-to-end zero-trust network — we now have the ability to enforce API-level micro-segmentation across the entire platform (100% without any gaps) — the entire network is a globally distributed proxy with API-level access and zero network-level access.
  2. Reduction in false positives (by >82%) through ability to continuously tune our models, we have significantly reduced the number of false positives that our NOC and SOC teams have to deal with. This was a big concern with traditional tools given the number of alarms they raised, making them entirely useless.
  3. Significant reduction in latency and compute utilization by slowly migrating more tasks that were performed by traditional rules and signature-based algorithms to our new machine learning core.

Network and app security is a never-ending runway and it looks like we still have a long backlog of new features to add. We will come back in the near future to share additional insights into the incremental algorithms and techniques we have implemented.


To be continued…

This series of blogs will cover various aspects of what it took for us to build and operate our globally distributed SaaS service with many app clusters in public clouds, our private network PoPs, and edge sites. Next up will be “Observability across our Globally Distributed Platform” (coming soon)…