The decision to use public cloud infrastructure for big data management is a fairly simple one for IT teams who have an immediate need for storage and computing or who are driven by an organisation-wide initiative.

[easy-tweet tweet=”For those deciding between on-premise and public #cloud, there are several criteria to consider “]

For those weighing their options between on-premise and public cloud, there are several criteria to consider in deciding on the best deployment route.

Data Location – Where is the data generated?

Data can be viewed as having “mass” and thus can prove difficult (and expensive) to move from storage to computing. If the big data management model is not the primary location for data, best practices suggest establishing the enterprise data hub close to data generation or storage to help mitigate the costs and effort, especially for the large volumes that are common to big data management workloads. That said, IT teams should explore the nature and use of the data closely, as volume and velocity might allow for streaming in small quantities or transfers of large, single blocks to an on-premise environment. Often, if data is generated in the public cloud or if the data is stored long term in cloud storage, such as an object store for backup or geo-locality, public cloud deployment becomes a more natural choice.

Workload Types – What are the workload characteristics?

For periodic batch workloads such as MapReduce jobs, enterprises can realise cost savings by running the cluster only for the duration of the job and paying for the usage as opposed to keeping the cluster activated at all times. This is especially true if the workload is only run for a couple of hours a day or a couple of days a week. For workloads that have continuous and long-running performance needs such as Apache HBase and Impala, the overhead of commissioning and decommissioning a cluster for the term of the event may not be justified.

Performance Demands – What are the performance needs?

One of the underlying tenets of Hadoop is tightly coupled units of compute and local storage that scale out linearly and simultaneously. This computation proximity enables Hadoop to parallelise the workload and significantly accelerate the processing of massive amounts of data within a short period of time. However, a common foundation of cloud architectures is pools of shared storage and virtualised compute capacity that are connected via a network pipe.

These capabilities scale independently, but the network adds latency and shared storage can become a performance bottleneck for a high-throughput MapReduce job, but the exact performance needs vary from workload to workload. The ecosystem of cloud vendors offers enterprises many architectural options and configurations that can address more directly the particular needs of a workload. For example, IT teams should examine the proximity of storage to compute as well as the degree of shared resources within the service as potential factors to performance, from fully virtual instances to standalone, bare-metal systems.

Performance is often an important criterion when processing large volumes of data typical of Hadoop workloads. For non-production, development, or test workloads, this factor might be less of a concern, which makes running these workloads against shared storage a potentially viable option. For production workloads, public cloud environments are still viable, but IT teams need to be more deliberate in their selection of proximity and resource contention, for example, in order to meet the performance requirements. Data location, like cloud-based storage, and types of workloads, like periodic batch processing, are strong influencers on the decision to deploy into the public cloud, yet many see the total cost of ownership—in terms of rapid procurement and provisioning of resources and the associated opportunity costs—as the most important motivator.

Cloud TCO – What is the difference in Total Cost of Ownership (TCO)?

Calculating the TCO of a public cloud deployment can extend beyond the options for compute, storage, data transfer, and the pricing thereof. A good starting point to narrow down the options is to use reference architectures from your service provider, for the cloud environment of choice. Based on the options from the reference architecture best suited for the workload or workloads, enterprises can further develop their expected usage patterns and arrive at a more accurate TCO for deploying a big data management model in the public cloud.

[easy-tweet tweet=”#Data can be viewed as having mass and thus can prove difficult to move from storage to computing”]

While deployment choice does not fundamentally change the architecture of the data management model, the additional benefit of on-demand provisioning and elasticity in the public cloud does open new possibilities for this evolution in data management.