Every cloud provider makes it possible to create geographically distributed environments. I’m going to take a closer look at this using an example of Google Cloud Platform – GCP.
Why do we need geographically distributed environments? What motivates the creation geographically distributed environments? There are three main reasons.
High availability suggests a system that is available most or all of the time. It avoids single points-of-failure and leads to redundancy of system components and risk diversification. Redundancy means that system components have backups, and it one instance fails another remains available. This is called “failover.” This simple method is used even in aerospace engineering.
Distributing infrastructure geographically allows us to avoid single point-of failure and mitigate risks related to the geographical location of a system and data.
Disaster recovery, the set of procedures and tools used to recover infrastructure in the event of disaster, affects high availability. DR plans always includes RTO and RPO. RTO is a recovery time objective. This is time in which infrastructure can be completely restored following a disaster and function at full capacity. RTO affects high availability directly. If RTO is long – there’s no high availability. RPO is the recovery point objective, or maximum time in which data can be lost as a result of disaster. It doesn’t affect HA directly but is very important. Geographical distribution can be part of a disaster recovery plan. If we experience a natural disaster (e.g. earthquake) that impacts a system in one location, there will still be a system functioning in another location and we can failover to it.
Any application must transfer data from the user to an app and back, and probably between application components. In the era of high-speed Internet and networks, it’s usually not a problem. But what if your data center is in California and you many users who transfer significant amounts of data from Russia and get huge amounts in return. This could be an issue because of network overhead. Transferring data could take too long. leading to performance that doesn’t satisfy system requirements. In that case, geographic distribution is beneficial. Moving datacenter/application components closer to the user’s location makes it possible to decrease network overhead and improve performance.
Implement a Distributed Environments using GCP
GCP infrastructure distribution
Currently, google infrastructure extends to 23 regions, 70 availability zones and 140 network edge locations. Let’s take a look at these regions, zones and edge locations.
Regions are independent geographic areas that consist of zones. They are connected to each other by a network, but mostly independent. The picture illustrates geographic regions on a map.
A zone is a deployment area for Google Cloud resources within a region. A zone includes one or more datacenter(s). Zones should be considered a single failure domain within a region. Usually there is more than one zone in a region. When you deploy components of your cloud infrastructure in different zones within one region it is called multizone deployment. If you deploy them in different zones within different regions it is called multiregional deployment.
GCP’s services themselves can be global (multi-regional), regional, or zonal.
Edge location is an edge (or let’s say a tip) of a GCP’s custom dedicated network. It might be a point of Cloud Interconnect (which allows you to connect your network directly to GCP’s dedicated network) or a CDN edge point that makes it possible to deliver content faster. We’ll take a look at Cloud CDN service a bit later. The picture shows GCP’s network and edge locations.
Components for Designing Geographically Distributed Systems.
Now we understand how geographic distribution can help and how GCP’s infrastructure is distributed. The next step is to take a look at particular services, the bricks with which we can build our geographically distributed environment. Let’s start from Cloud VPC and discover how it can help build such an environment.
VPC stands for Virtual Private Cloud. VPC is global, scalable, and flexible. It consists of VPC Networks inside of which you can create resources as VM’s, kubernetes nodes, etc. These networks can be placed in different zones and regions and are logically isolated from each other. All Compute Engine VM instances, GKE clusters, and App Engine flexible environment instances rely on a VPC network for communications. The network connects the resources to each other and to the Internet. Routes and Firewalls make it possible to set up rules for communication between networks, resources, and the Internet. Let’s take a look at an example.
The main difference and one of advantages of GCP’s VPCs (in comparison with other cloud providers) is that they are not tied to a region. They are global and multi-region by default. You only create subnets in necessary regions and put workloads there. You don’t have to do VPC peering or pay for inter-region data transfer. GCP’s VPCs are global by their nature, which makes things easier for architects and developers.
Using VPC networks, we can achieve geographically distributed environment resources. But how we can set up a failover and route users from particular locations to particular regional resources?
Cloud Load Balancing
The load balancer’s task is to distribute user traffic across instances of application. Though its main goal is to spread the load, it also has great features like health check, which makes it possible to use for “failovering” and can also distribute the load between regions to resolve performance. There a few types of load balancers in GCP.
Internal load balancers spread the load inside Google Cloud. External load balancers distribute traffic from network to VPC network. External load balancers can be regional or global. The global load balancer makes it possible to distribute traffic between instances in different regions but Regional LB, as suggested by its name, distributes load only between instances in one region.
Let’s consider the design in the illustration below.
We can use the external global load balancer as an entry point for users around the world. It allows us to route users in Asia to instances in the Asian region and users from the US to the US-central region accordingly. The internal load balancer makes it possible to spread the load between internal tiers in one region and between zones. And this resolves the third problem mentioned at the beginning of the article – performance. But what about high availability and disaster recovery?
Health checks will do the job. You have to set up a health check for each load balancer. GCP has global and regional health check systems to connect to backends on a configurable, periodic basis. Every attempt is called a probe, and each health check system is called a prober. GCP records the result of each probe (success or failure). The backend that responds with a configurable response in a configurable time and/or number of probes is called healthy. The backend that fails to respond successfully is considered unhealthy.
Cloud load balancers automatically failover. They stop sending traffic to unhealthy instances and send only to healthy ones. This resolves the disaster recovery and high availability issues I mentioned at the beginning of the article.
Having external load balancers also makes infrastructure easier to maintain, deploy and scale. You don’t have to use “DNS hacks” to achieve traffic distribution based on location or simply availability of service in some particular region. I am convinced that it makes automation and DR actions much easier to implement.
So, considering the design in the illustration above, we can say that we can use the advantages of a geographically distributed environment to improve high availability and performance, as well as create a good disaster recovery plan. However, there is one more tool that can create a more distributed environment and improve performance.
Sending user traffic to the closest region is good but expensive for performance purposes and doesn’t resolve performance issues completely. Regions cover large areas and a regional datacenter can be far from a user. Cloud CDN (Content Delivery Network) uses Google’s global edge network to serve content closer to users. Cloud CDN simply caches data at the edge location closest to the user. When a user requests content for the first time, CDN takes the content from the origin server (e.g. instance or storage) and caches it at the edge location. When the user requests content for a second time or another user requests it from a nearby location, it takes it directly from the edge location. It is typically used for static content like pictures, videos and web server responses. It doesn’t make sense to cache frequently changed content. The picture below illustrates how Cloud CDN works.
In this article, we have highlighted the importance of geographical distribution and how it can improve availability, disaster recovery and performance in your system. Also, we’ve provided a high-level overview of several GCP cloud components: Cloud VPC, Cloud Load Balancing, and Cloud CDN, all of which can help us leverage the advantages of Google’s distributed infrastructure and implement a geographically distributed environment.
Going global with GCP is much easier because most of its services are global by nature. GCP makes it possible to be more efficient in aspects such as cost, automation, infrastructure design and performance. In my opinion, one of the most important skills in engineering (and software engineering isn’t an exclusion) – is the ability to select the proper tool for a specific task. If you’re building a of global, geographically distributed environment GCP is an excellent choice.