Category Archives: Oozie

Cassandra – NoSQL database to use in conjunction with Hadoop

Some use cases feed data directly into Hadoop from their source (such as web server logs), but others feed into Hadoop from a database repository. Still others have use cases in which there is a massive output of data that needs to be stored somewhere for post-processing. One model for handling this dataset is a NoSQL database, as opposed to SQL or flat files.

Cassandra is an Apache project that is popular for its integration into the Hadoop ecosystem. It can be used with components such as Pig, Hive, and Oozie. Cassandra is often used as a replacement for HDFS and HBase since Cassandra has no master node, so eliminates a single point of failure (and need for traditional redundancy). In theory, its scalability is strictly linear; doubling the number of nodes will exactly double the number of transactions that can be processed per second. It also supports triggers; if monitoring detects that triggers are running slowly, then additional nodes can be programmatically deployed to address production performance problems.

Cassandra was first developed by Facebook. The primary benefit of its easily distributed infrastructure is the ability to handle large amount of reads and writes. The newest version (2.0) solves many of the usability problems encountered by programmers.

DataStax provides a commercially packaged version of Cassandra.

MongoDB is a good non-HBase alternative to Cassandra.

Sources:

Apache Ambari: A suite of applications/components to provision, manage, and monitor Hadoop clusters

System Admins:

Provision

  • Wizard for installing/configuring Hadoop services across many hosts

Manage

  • Start, stop, reconfigure Hadoop across many hosts

Monitor

  • Dashboard for health & status
  • Metrics via Ganglia (Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids)
  • Alerting via Nagios

Developers:

  • Integrate provisioning, mangement, and monitoring into their own application using the Ambari REST APIs

These tools are supported by Ambari:

  • HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop

Sources:

Article on IBM DeveloperWorks

Open Source Big Data for the Impatient, Part 1: Hadoop tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL

Source:

Controlling Hadoop jobs

Oozie is a workflow scheduler for Hadoop. It provides if-then-else branching and control within jobs, and also can chain together multiple jobs.

Source: