Category Archives: AWS

Xplenty offers HaaS in AWS

Xplenty offers Hadoop as a Service for Amazon Web Services in all AWS global regions. This HaaS offering promises a “coding free design environment”, of course in additional to AWS hardware free environment.


HaaS Provider Qubole Now Runs on Google Compute Engine (GCE)

I’m starting to see applications ported from AWS to GCE, but not sure about the justifications for running production systems on GCE. Maybe price?


Spotify Embraces Hortonworks, Dumps Cloudera

Cloudera had appeared to be the defacto standard of Hadoop distributions, but Hortonworks has scored big in this deal. Spotify has a 690 node Cloudera cluster that it will be moving to a Hortonworks cluster (undisclosed size). Apparently it’s the new Hive implementation that makes Hortonworks so attractive.

When Spotify launched in 2008 it had a 30 node cluster hosted on Amazon’s AWS, then switched to an on-premises 60 node cluster that grew to 690 nodes. The cluster currently contains 4 petabytes of data which grows by 200 gigabytes per day.

Spotify has a 12 person Hadoop team and uses a Python (not Java) framework for batch processing.


Use Cases for Hadoop

The Apache “Powered by Hadoop” page lists a long list of companies that use Hadoop. Some only list the company name. Others have a sentence or two about what they’re using Hadoop for. And some, like LinkedIn, list the specs of the hardware that they have Hadoop running on. There’s also a link to a great article about how the NYTimes used Hadoop  (in 2007!!!) running on Amazon’s AWS cloud to generate PDFs of 11 million articles in 24 hours running on 100 machines. One of the things I find interesting about the NYTimes use case is that they used Hadoop for a one time batch process. A lot what we read about Hadoop assumes that the use case is an ongoing, maybe multi-year application.


Cheap Hardware

In theory, a big data cluster uses low cost commodity hardware (2 CPUs, 6-12 drives, 32 GB RAM). By clustering many cheap machines, high performance can be achieved at a low cost, along with high reliability due to decentralization.

There is little benefit to running Hapdoop nodes in a virtualized environment (e.g VMWare), since when the node is active (batch processing) it may be pushing RAM and CPU utilization to its limits. This is in contrast to an application or database server which has idle and bursts, but generally has constant utilization at some medium level. What is of greater benefit is a cloud implementation (e.g. Amazon Elastic Cloud) in which one can scale from a few nodes to hundreds or thousands of nodes in real time as the batch cycles through its process.

Unlike a traditional n-tier architecture, Hadoop combines compute & storage on the same box. In contrast, an Oracle cluster would typically store its databases on a SAN, and application logic would reside on yet another set of application servers which probably do not utilize their inexpensive internal drives for application specific tasks.

A Hadoop cluster is linearly scalable, up to 4000 nodes and dozens of petabytes of data.

In a traditional db cluster (such as Oracle RAC), the architecture of the cluster should be designed with knowledge of the schema and volume (input and retrieval) of the data. WIth Hadoop, scalability is, at worst, linear. Using a cloud architecture, additional Hadoop nodes can be provisioned on the fly as node utilization increases or decreases.