Category Archives: cloud

IBM PureApplications for Hybrid IaaS Cloud

IBM PureApplications provides on-premise cloud. #PureApp for SoftLayer provides off-premises cloud solutions. @Prolifics

Video includes clip from my manager @Prolifics, Mike Hastie.

Big Data as a Service provider has free developer account

Founders of Qubole built some of the big data technology at Facebook (scaled to 25 petabytes). Their new company has a hosted Hadoop infrastructure. Interesting small and free accounts take the IT configuration out of learning Hadoop.


Summary of Terradata’s big data approach

  • Terradata Aster 6 platform
  • Includes graph analysis engine (visualization), in addition to traditional rows/columns.
  • Enables execution of SQL across multiple NoSQL repositories
  • Integrates with multiple 3rd parties for solutions such as analytical workflow (Alteryx), advanced analytics algorithms (Fuzzy Logix).
  • Cloud services at comparable cost to on-premises



Using Yarn to monitor resources and provision capacity in order to run other applications alongside MapReduce

Hadoop 2.0 enables clusters to grow as large as 4000 nodes within deployments that contain multiple clusters. I think that companies like Google and Facebook each run tens of thousands of nodes.

Using Yarn, developers can run additional applications within the cluster by monitoring what the applications need, and then creating CPU/RAM containers within the cluster (and across clusters?) to run them.

There’s speculation that eventually Yarn could provide a PaaS using Hadoop in order to compete with VMWare’s Cloud Foundry. I suppose that while with VMWare you need to first think in terms of virtualizing hardware components and an operating system, Yarn jumps past that to provide an environment that’s abstracted for a specific application.


HDFS fault tolerance

HDFS is fault tolerant. Each file is broken up into blocks, and each block must be written to more than one server. The number of servers is configurable, but three is the common configuration. Just as with RAID, this provides fault tolerance and increase retrieval performance.

When a block is read, its checksum indicates whether the block is valid or corrupted. If corrupted, and depending on the scope of the corruption, the block may be rewriten or the server may be taken out of the cluster and the blocks spread to other existing servers. If the cluster is running within an elastic cloud then either the server is healed or a new server is added.

Unlike high end SAN hardware which is architected to avoid failure, HDFS assumes that its low end equipment will fail so it has self-healing built into its operating model.

Cheap Hardware

In theory, a big data cluster uses low cost commodity hardware (2 CPUs, 6-12 drives, 32 GB RAM). By clustering many cheap machines, high performance can be achieved at a low cost, along with high reliability due to decentralization.

There is little benefit to running Hapdoop nodes in a virtualized environment (e.g VMWare), since when the node is active (batch processing) it may be pushing RAM and CPU utilization to its limits. This is in contrast to an application or database server which has idle and bursts, but generally has constant utilization at some medium level. What is of greater benefit is a cloud implementation (e.g. Amazon Elastic Cloud) in which one can scale from a few nodes to hundreds or thousands of nodes in real time as the batch cycles through its process.

Unlike a traditional n-tier architecture, Hadoop combines compute & storage on the same box. In contrast, an Oracle cluster would typically store its databases on a SAN, and application logic would reside on yet another set of application servers which probably do not utilize their inexpensive internal drives for application specific tasks.

A Hadoop cluster is linearly scalable, up to 4000 nodes and dozens of petabytes of data.

In a traditional db cluster (such as Oracle RAC), the architecture of the cluster should be designed with knowledge of the schema and volume (input and retrieval) of the data. WIth Hadoop, scalability is, at worst, linear. Using a cloud architecture, additional Hadoop nodes can be provisioned on the fly as node utilization increases or decreases.