Does running your application in the cloud mean that it’s suddenly able to survive any problem that arises? Alas, no. Even while some foundational services of distributed systems are built for high availability, a high performing cloud application needs to be explicitly architected for fault-tolerance. In this multi-part blog series, we will walk through the various application layers and example how to build a resilient system in the CenturyLink Cloud cloud. Over the course of the next few posts, we will define what’s needed to build a complete, highly available system. The reference architecture below illustrates the components needed for a fictitious eCommerce web application.
In this first post, we look at a core aspect of every software system: storage. What type of storage is typically offered by cloud vendors?
- Temporary VM storage. Some cloud providers offer gobs of storage with each VM instance, but with the caveat that the storage isn’t durable and does not survive server shutdown or server failure. While this type of cheap and easy accessible block storage is useful in some situations, it’s not as familiar to enterprise IT staff who are used to storage that’s durable by default.
- Persistent VM storage. This sort of block storage is attached to a VM as durable volumes. It can survive reboots, resets, and can even get detached from one server and reattached to another. Multiple servers cannot access the same volume, but this is ideal for database servers and other server types that need reliable, durable, high-performing storage.
- Object storage. What happens if you want to share data between consumers? Object storage offers HTTP access to a highly available repository that can hold virtually any file type. This is a great option for storing business documents, software, server backups, media files, and more. It is also a useful alternative for secure file transfer.
At CenturyLink Cloud, we offer customers two options: persistent block storage and object storage.
Provisioning Persistent Block Storage
Each virtual server launched in the CenturyLink Cloud cloud is backed by one or more persistent storage volumes. Product details:
- Block storage volumes can be of any size up to 1 TB apiece. Why does this matter? Instead of over-provisioning durable storage – which can happen with cloud providers that offer fixed “instance sizes” – CenturyLink Cloud volumes can be any size you want. Only pay for what you need, and resize the drive as necessary.
- The volumes are attached as ISCSI or NFS and offer at least 2500 IOPS. Why does this matter? Run IO-intensive workloads with confidence and get reliable performance thanks to an architecture that minimizes latency and network hops.
- Block storage is backed by SANs using RAID 10 which provides the best combination of write performance and data protection. Why does this matter? We’ve architected highly available storage for you. Data is striped across drives and mirrored within RAID sets. This means you that you won’t lose your data even if multiple underlying disks failed.
- We take daily snapshots of each storage volume automatically. Standard storage volumes have 5 days of rolling backups, and Premium storage volumes have 14 days of rolling backups with the 5 most recent ones replicated to a remote data center. Why does this matter? This gives you a built-in disaster recovery solution! While it may not be the only DR strategy you employ, it provides a baseline RPO/RTO to build around.
The way that CenturyLink Cloud has architected its block storage means that you do not need to specifically architect for highly available storage unless you are doing multi-site replication.
For our reference solution, provisioning persistent block storage is easy. Our web servers – based on Windows Server 2012 Data Center Edition – have 42 GB of durable storage built-in, and I’ve added another 100 GB volume to store the web root directory and server logs.
There are multiple ways to layout the disks for a database server, and in this case, we’re splitting the databases and transaction logs onto separate persistent volumes.
Here, we have running servers backed by reliable, high-performing storage. What happens if you find out later that you need more storage? The CenturyLink Cloud platform makes it easily to instantly add more capacity to existing volumes, or add entirely new volumes that are immediately accessible within the virtual machine.
Provisioning Object Storage
Object Storage is a relatively recent addition to the CenturyLink Cloud cloud and gives customers the chance to store diverse digital assets in a highly available, secure shared repository. Some details:
- Object Storage has multiple levels of redundancy built in. Within a given data center, your data is replicated across multiple machines, and all data is instantly replicated to a sister cluster located within the same country. Why does this matter? Customers can trust that data added to Object Storage will be readily available even when faced with unlikely node or data center failures.
- Store objects up to 5 GB in size. Why does this matter? Object Storage is a great fit for large files that need to be shared. For example, use Object Storage for media files used by a public website. Or upload massive marketing proofs to share with a graphics design partner.
- The Object Storage API is Amazon S3-compliant. Why does this matter? CenturyLink Cloud customers can use any of the popular tools built for interacting with Amazon S3 storage.
Customers of CenturyLink Cloud Object Storage do not need to explicitly architect for high availability since the service itself takes care of it.
In our reference solution, Object Storage is the place where website content like product images are stored. We added a new Object Storage “bucket” for all the website content.
Once the bucket was created and permissions applied, we used the popular S3 Browser tool to add CSS files and images to bucket.
A highly available system in the cloud is often a combination of vendor-provided and customer-architected components. In this first post of the blog series, we saw how the CenturyLink Cloud platform natively provides the core high availability needed by cloud systems. Block storage is inherently fault tolerant and the customer doesn’t have to explicitly ask for persistent storage. Object storage provides easy shared access to binary objects and is geo-redundant by default.
Storage provides the foundation for a software system, and at this point we have the necessary pieces to configure the next major component: the database!
We generate massive amounts of data every day. Research firm IDC estimates that 90% of the world’s data was created in the last two years, and the volume of data worldwide doubles every two years. Enterprises are a key contributor to this data explosion as we produce and share digital media, create global systems that collect and generate data, and retain an increasing number of backup and archive data sets. This rapid storage growth puts pressure on IT budgets and staff who have to constantly find and allocate more usable space. CenturyLink Cloud wants to help make that easier and just launched a new Object Storage service to provide you a secure, scalable destination for business data.
What is Object Storage from CenturyLink Cloud? It’s a geo-redundant, elastic storage system for public and private digital data. Based on the innovative Riak CS Enterprise platform, Object Storage infrastructure is being deployed across three global regions: Canada, United States, and Europe. Each region consists of a pair of CenturyLink Cloud data centers that run Riak CS Enterprise on powerful, bare-metal servers. The Object Storage nodes are deployed in a “ring” configuration where data is evenly distributed across the nodes, thus assuring that your data is available even if multiple nodes go offline. When objects are loaded into one data center, they are instantly replicated to the in-country peer data center. This means that an entire data center can go offline, and you STILL will have uninterrupted access to all of your latest enterprise data.
Before diving into this new service, let’s define a few terms:
- Object. An “object” is any digital asset that is less than 5 GB in size. This could be a video that you display on your public website, a PDF file that you are sharing with a business partner, or a database backup file. If the object is larger than 5 GB, then you can do a multi-part upload!
- Bucket. Objects are stored in buckets. A bucket is a logical container that can hold an unlimited number of objects, but not other buckets.
- Region. CenturyLink Cloud has architected Object Storage with unique clusters in three different geographies. Each geographic region has a pair of data centers that hold all of the data uploaded into that region.
- User. An Object Storage user is different from a CenturyLink Cloud platform user and is created separately. While you may create an Object Storage user to represent an individual person, you may also choose to create users that correspond to an application. For example, you may define a user leveraged by your public website that retrieves images and videos from Object Storage.
- Owner. Each bucket has an owner. This is the user that automatically has full control over the bucket and its objects.
- ACLs. Access Control Lists govern who can manage buckets and see objects. By default, Object Storage does not allow any public access to buckets or objects. If you choose, you can provide public, unauthenticated users with the ability to read individual objects. Or, you can choose specific users that have permission to add objects to buckets or view an object.
Managing Object Storage
Interacting with Object Storage is easy. We’ve added a management interface in our Control Portal for Object Storage administrators. From here, you can view a list of users, add new users, and reset user credentials.
The Control Portal also has a bucket administration component where you can view, create, secure, and delete buckets.
Each bucket can have its own security profile. For a bucket such as “website media”, you may let “All Users” have read access to its objects. For buckets set up to exchange large files with business partners, you would likely add read and write permissions for a user representing the chosen partner.
It’s unlikely that you’ll only use a single interface to interact with your data objects. Thanks to the inherent S3 compatibility offered by Riak CS Enterprise, you don’t have to! There is an entire ecosystem of tools for working with object storage that support an Amazon S3-like interface. Want to use a client tool to upload and delete objects? Then check out a utility like the freemium S3 Browser where you can plug in your Object Storage user credentials (and CenturyLink Cloud Object Storage URL) and manage buckets AND objects.
Looking to mount Object Storage as a drive on your database server so that you can easily create and restore backups? Look to a product like ExpanDrive which makes it easy to add Object Storage as a storage volume.
CenturyLink Cloud is among the first cloud providers to offer native, geo-redundant object storage and we’re excited to see how our customers use this to escape the burden of endless provisioning of on-premises storage! Our Canada region is live today, with the United States and Europe following closely. Existing customers can get started right away, and new customers can take Object Storage for a spin by signing up today.
Companies embrace the cloud because it offers agility, speed to market, self-service, rapid innovation, and yes, cost savings. There are plenty of cases where organizations can save money by using cloud resources, but it’s easy to focus on vendor compute and storage pricing, and forget about all the other financial components of a cloud application. See Joe Weinman’s Cloudonomics for an excellent analysis of how to assess the economic impact of using the cloud. An application can very easily cost MORE in the cloud – but that might still be just fine, since it helps the business shed some CapEx and remove servers from corporate data centers. In this post, we’ll talk about the full scope of pricing cloud applications and give you a useful perspective for assessing the overall cost.
Businesses deploy applications, not servers. A typical application is comprised of multiple servers that perform different roles. For instance, let’s consider an existing, commercial website that receives a healthy amount of traffic. It uses a load balancer to route traffic to one of multiple web servers, leverages a series of application servers for caching and business services, and uses a relational database for persistent storage.
To maximize revenue and customer satisfaction, the application is replicated in another geography for availability reasons and traffic can be quickly steered to the alternate site in the case of a disaster or prolonged outage.
“Hidden costs” often bite cloud users. This is especially true for those who buy from a cloud that offers “cheap virtual cores!” but also require you to buy countless other services to assemble an enterprise-class infrastructure landscape. Let’s look at each area where it’s possible – and likely – that you will incur a charge from your cloud provider.
- Application migration. If you are doing greenfield development in the cloud, then this won’t apply. But if you have existing applications that are moving to the cloud, there are a few migration-related costs. First, there can be a labor cost with doing virtual machine imports. Some cloud providers let you import for free, others charge you. In most cases, there is also a bandwidth charge for the transfer of virtual machine images. Finally, there’s likely a cost for storing the virtual machine image during the import process.
- Server CPU processor. This – along with RAM – is the number most frequently bandied about when talking about the costs of running a cloud application. Some providers let you provision the exact number of virtual CPU cores desired; others provide fixed “instance sizes” that come with a pre-defined allocation of CPUs and memory.
- Server memory. Cloud providers are ratcheting up the amount of RAM they offer to address memory-hungry applications, caching products, and in-memory databases.
- Server storage. There are many different types of storage (e.g. block storage, object storage, vSAN storage) and costs vary with each. Don’t forget to include the cost of storing data backups, virtual machine templates, and persistent disks that survive even after servers have been deleted.
- Bandwidth. It’s easy to forget about bandwidth, but it’s a charge that can bite you if you’re not expecting it! You may need to factor in public bandwidth, intra-data center bandwidth, inter-data center bandwidth, CDN bandwidth, and load balancer bandwidth. Not all of these may apply, and some may not be charged by your cloud provider, but it’s important to check ahead of time. Most cloud providers use the “GB transfer” model, charging for all data transferred – and penalizing customers for bursting above their commitments. CenturyLink Cloud utilizes the 95th percentile billing method, preventing surges in traffic from grossly affecting costs.
- Public IP addresses. Nearly every cloud provider offers a way to expose servers to the public Internet, and some charge for the use of public IP addresses. This is usually a nominal monthly charge, but one to consider for scenarios where there are dozens of Internet-facing servers.
- Load balancing. There is often a charge to not only use a load balancer, but also for the traffic that passes through it.
- VPN and Direct Connect. Cloud users are looking for ways to connect cloud environments to on-premises infrastructure, and vendors now offer a rich set of connectivity options. However, those options come at a cost. Depending on the choice, you could be subjected to fees for setup, operations, and bandwidth associated with these connections.
- Firewalls. This is usually baked into each cloud provider’s native offering, but you will want to check and make sure that sophisticated firewall rules don’t come with an additional charge.
- Server monitoring. Even those cloud servers aren’t in your data center, it doesn’t mean that you don’t need to monitor them! Depending on your monitoring needs, there can be a range of charges associated with standard and advanced monitors for each cloud server.
- Intrusion detection. Given that cloud servers are often accessible through the public Internet, it’s important to use a defense-in-depth approach that includes screening incoming traffic for potential attacks. CenturyLink Cloud is a bit unique in that we offer this at no cost, but you can still get this sort of protection from other vendors – but rarely for free.
- Labor for integrating with on-premises assets. You don’t want to create silos in the cloud, and you will likely spend a non-trivial amount of time integrating with your critical applications, data, identity provider, and network. If this effort requires assistance from the cloud provider themselves, there could be a charge for that time and effort.
- Distributed, disaster recovery environments. Applications fail, and clouds fail. If you require very high availability, you may need to duplicate your application in other geographically-dispersed cloud data centers. You could choose to keep that environment “warm” by synchronizing a data repository while keeping web/application servers offline. Or, you may choose to build a truly distributed system that leverages active infrastructure across geographies. Either way, it’s possible that you’ll incur noticeable charges for establishing replica environments.
- Development / QA environments. Applications may run differently in the cloud than in your local data center. Hence, you could choose to provision pre-production environments in the cloud for building and running your applications.
- System administrator labor costs. One of the wonderful things about the cloud is the widespread automation that makes it possible to provision and maintain massive server clusters without adding to your pool of system administrators. However, there are still activities that require administration. This may involve server patching and software updates, deploying new applications, and scaling the environments. Some of those activities can be automated as well, but you should factor in human costs to your cloud budget.
Places to save money
Given the various charges you may incur by moving to the cloud, how can you optimize your spend and take full advantage of what the cloud has to offer? Here are five tips:
- Don’t over-provision. Gone are the days when you have to request a massive server from an internal IT department because you MAY need the extra resources in the future and don’t want to deal with the hassle of upgrading the server later. CenturyLink Cloud makes it simple to change the number of virtual CPUs, amount of RAM, or amount of storage in seconds. Only spend money on what you need right now, and only pay more when you have to scale up. In addition, don’t settle for cloud providers who force you into fixed “instance sizes” that don’t deliver the mix of vCPU/RAM/storage that your application needs. CenturyLink Cloud encourages you provision whatever combination of vCPU/RAM/storage that you want! In fact, we usually tell customers to under-provision to start with, and ratchet up resources as needed.
- Turn off idle servers. If you decide to create development or QA environments in the cloud, it’s likely that those environments will be fairly quiet over weekends. By shutting those down – and doing it automatically – you can potentially save hundreds or thousands of dollars per year.
- Automate mundane server management tasks. Running maintenance scripts or installing software on a cluster of servers is time consuming and tedious. CenturyLink Cloud provides an innovative Group capability that makes it possible to issue power commands, install software, and run scripts against large batches of servers.
- Add resource limits to prevent runaway provisioning. Elasticity is a foundational aspect of cloud computing, but it’s not a bad idea to establish resource caps. With CenturyLink Cloud for example, customers can define the maximum amount of vCPUs, memory, and storage that any one Group can consume.
- Carefully consider uptime requirements and disaster recovery needs. Even though the cloud makes it easier, it’s still not cheap or simple to build a globally distributed, highly available application. Evaluate whether you need cross-data center availability, or, a defined disaster recovery plan. The simplest solution for CenturyLink Cloud customers is to provision Premium block storage which provides daily snapshots and replication to an in-country data center. In the event of a disaster, CenturyLink Cloud brings up your server in an alternate data center and gets you back in business. If you want to avoid nearly any downtime, then you can architect a solution that operates across multiple data centers. To save money, you could choose to keep the alternate location offline but synchronized so that it could quickly activated if needed.
When considering all the services you need to deploy and operate enterprise-level business applications, the “cheap virtual cores!” pitch is less compelling. It’s about finding a cloud provider that offers an all-up, integrated offering that gives you the set of services you need to deploy and maintain a robust, connected infrastructure. Give CenturyLink Cloud a try and see if our innovative platform is exactly what you’re looking for!
It’s easy for cloud customers to get confused about the roles and responsibilities of their internal team and their cloud vendor. That confusion is especially evident when it comes to application availability and business continuity planning. How does disaster recovery differ from high availability? Does my cloud provider automatically load balance my application servers? The answers to these questions are critical, but sometimes overlooked until a crisis occurs. In this post, we’ll talk about load balancing, high availability, and disaster recovery in the cloud, and what the CenturyLink Cloud’s cloud infrastructure has to offer.
What is it?
Wikipedia describes load balancing as:
Load balancing is a computer networking method to distribute workload across multiple computers or a computer cluster, network links, central processing units, disk drives, or other resources, to achieve optimal resource utilization, maximize throughput, minimize response time, and avoid overload. Using multiple components with load balancing, instead of a single component, may increase reliability through redundancy.
You commonly see this technique employed in web applications where multiple web servers work together to handle inbound traffic. There are at least two reasons why load balancing is employed:
- The required capacity is too large for a single machine. When running processes that consume a large amount of system resources (e.g. CPU and memory), it often makes sense to employ multiple servers to distribute the work instead of constantly adding capacity to a single server. In plenty of cases, it’s not even possible to allocate enough memory or CPU to a single machine to handle all of the work! Load balancing across multiple servers makes it possible to host high traffic websites or run complex data processing jobs that demand more resources than a single server can deliver.
- Looking for more reliability and flexibility in a solution deployment. Even if you *could* run an entire server application on a single server, it may not be a good idea. Load balancing can increase reliability by providing many servers able to do the same job. If one server becomes unavailable, the others can simply pick up the additional work until a new server comes online. Software updates become easier since a server can simply be taken out of the load balancing pool when a patch or reboot is necessary. Load balancing gives system administrators more flexibility in maintaining servers without negatively impacting the application as a whole.
Load balancing can be accomplished using either a “push” or a “pull” model. For web applications or database clusters that sit behind a load balancer, inbound requests are pushed to the pool of servers based on an algorithm such as round-robin. In this scenario, servers await traffic sent to them by the load balancer. It’s also possible to use a “pull” model where work requests are added to a centralized “queue” and a collection of servers retrieve those requests from that queue when they are available. For instance, consider big data processing scenarios where many servers work to analyze data and return results. Each server takes a chunk of work and the overall processing load is distributed across many machines.
How can CenturyLink Cloud help?
CenturyLink Cloud offers multiple load balancing options to our customers. All customers have access to a free, shared load balancer. This load balancer service – based on the powerful Citrix Netscaler product – provides a range of capabilities including SSL offloading for higher performance, session persistence (known as “sticky sessions”), and routing of TCP, HTTP and HTTPS traffic for up to three servers. To use this service today, send a request to firstname.lastname@example.org. We plan to launch a self-service version of this capability in the very near future.
If you’re looking for more control over the load balancing configuration or have higher bandwidth needs, you can deploy a dedicated load balancer (virtual appliance) into the CenturyLink Cloud cloud. This “bring your own load balancer” option leverage internal expertise you may have with a particular vendor. It also gives you complete control over the load balancer setup so that you can modify the routing algorithm or enable/disable features that matter to your business.
What is it?
Returning to Wikipedia, high availability is defined as:
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.
High availability is described through service level agreements and achieved through an architecture that focuses on constant availability even in the face of failures at any level of the system. While load balancing introduces redundancy, it’s not a strategy that alone can provide high availability. Servers sitting behind a load balancer may be running, but that doesn’t mean that they are available!
Availability addresses the ability to withstand failure from all angles including the network, storage, and even the data center itself. Enterprise cloud services like those from CenturyLink Cloud are built on a highly available architecture that uses redundancy at all levels to ensure that no single component failure in a data center impacts overall system availability. This includes “passive” redundancy built into data centers to overcome power or internet provider failures, as well as “active” redundancy that leverages sophisticated monitoring to detect issues and initiate failover procedures.
All of our customers get platform-level high availability when they use the CenturyLink Cloud cloud “out of the box.” That means that you can rely on us for your workloads knowing that our architecture is well-designed and highly redundant. However – back to the introductory paragraph – it’s the customer’s responsibility to design a highly-available application architecture. Simply deploying an application to our cloud doesn’t make it highly available. For example, if you deploy a single Microsoft SQL Server instance in the CenturyLink Cloud cloud, you do not have a highly available database. If that database server goes offline or network access is interrupted, your application’s availability will be impacted. To design a highly available Microsoft SQL Server solution, you have multiple options. One choice is to create a cluster of database servers (where all nodes are active at the same time, or, nodes sit passively by waiting to be engaged) that access data from a shared disk. When a failure in the active node is detected, the alternate node is automatically called into action.
How can CenturyLink Cloud help?
Designing highly available systems is complex. Unfortunately, no cloud provider can offer a checkbox labeled “Make this application highly available!” in their cloud management portal. Crafting a highly available system involves a methodical approach that navigates through every single layer of the system and identifies single points of failure that should be made redundant. For components that cannot be made redundant, it’s important to make sure that the application can continue to run even if that component becomes unavailable.
The CenturyLink Cloud professional services team consists of skilled, experienced architects who have designed and built cloud-scale solutions for customers. They can sit with your team and make sure that you’ve taken advantage of every relevant feature that CenturyLink Cloud has to offer, while helping you make sure that your system landscape is constructed in a way that will ensure continual availability.
Don’t forget to regularly test your high availability design in order to uncover weak points or ensure that configurations remain valid.
What is it?
Once more we turn to Wikipedia which defines disaster recovery as:
Disaster recovery (DR) is the process, policies and procedures that are related to preparing for recovery or continuation of technology infrastructure which are vital to an organization after a natural or human-induced disaster. Disaster recovery is a subset of business continuity. While business continuity involves planning for keeping all aspects of a business functioning in the midst of disruptive events, disaster recovery focuses on the IT or technology systems that support business functions.
DR is all about how you handle unexpected events. Typically, your cloud provider has to declare a disaster before explicitly initiating DR procedures. A brief network outage or storage failure in a data center is usually not enough to trigger a disaster response. There are two phrases that you often hear when defining a DR plan. A recovery point objective (RPO) describes the maximum window of data that can be lost because of a disaster. For example, an RPO of 12 hours means that it is possible that when you get back online after a disaster, you may have lost the most recent 12 hours of data collected by your systems. A recovery time objective (RTO) identifies how long the IT systems (and processes) can be offline before being restored. For example, an RTO of 48 hours means that it may take two days before the systems lost in the disaster are brought back online and becoming usable again.
How can CenturyLink Cloud help?
CenturyLink Cloud customers have disaster protection natively in the platform. We offer two classes of storage: standard and premium. The major difference is that standard storage get five days of rolling backups within a given data center, while premium storage users get fourteen days of rolling backups including replication to an in-country data center. CenturyLink Cloud is powered by global data centers in multiple countries and we use storage replication to enable you to get back online within 8 hours (RTO) and with a maximum RPO of 24 hours.
While this provides assurances against losing all of your data in the event of a disaster, it still may not provide the level of business continuity that you need. If your business cannot tolerate more than a few moments of downtime, even in the event of a disaster, then it’s critical to architect a solution that can withstand the loss of an entire data center. Returning to our earlier Microsoft SQL Server example, consider the ways to construct a highly available database that remains online with minimal data loss, even during a disaster. SQL Server offers replication technologies like database mirroring and AlwaysOn that make it possible to do near-real time replication across geographies.
The experts in the CenturyLink Cloud services team can help you identify all the DNS, networking, compute and storage considerations for building systems that are not only highly available within a data center, but across data centers.
It’s often the case that load balancing, high availability and disaster recovery lapses don’t surface until it’s too late. While CenturyLink Cloud does everything we can to architect our platform for maximum availability and resiliency, our customers still retain responsibility for deploying their systems in a manner that meets their performance and business continuity needs. We are eager to talk to you about how to validate your existing cloud applications or design new solutions that can function at cloud scale. Contact our services team today!