Category Archives: Zookeeper

What benefit does Yarn bring to the existing MapReduce?

Within the classic MapReduce is the Job Tracker component. Yarn splits Job Tracker into two further components: Resource Manager (aka RM) (allocating cpu, ram, etc) and Node Manager (aka NM) (which operates at the level of a single node/machine). The Application Manager (aka AsM) negotiates resources from the Resource Manger and with the Node Manager to execute tasks. Job Tracker is already an ancient architecture — five years old!!

Yarn is sometimes referred to as MapReduce 2.0 or MRv2.

Resource Manager supports hierarchical application queues to guarantee allocation ratios of cluster resources. However, it does not enable recovery from application or hardware failures. It does not monitor. It only schedules. Scheduling methods include FIFO (default) and Capacity. Fair is not currently supported.

ZookKeeper monitors Resource Manager in order to switch to a secondary if Resource Manager itself fails. In a failover scenario, running applications are restarted and the queue continues. Preservation of state within currently running applications is handled by checkpoints stored by the Application Master within HDFS.

Rather than having specific containers to execute Map jobs and Reduce jobs, Yarn enables containers for more generic jobs, which enables developers to write other applications that run on the cluster.

It’s unclear whether Yarn will make the system run faster or slower. Generalization and modularization usually comes at a cost. However, Yarn allows for more complete utilization of CPU and RAM resources so in theory can squeeze every last bit of capacity out of a cluster, whereas the fixed size containers in MapReduce 1.0 could have left some resources idle. Yarn does not mange I/O which is typically a bigger bottleneck than RAM. There’s also no management of network bandwidth in Yarn. (Note to self, got to figure this out: I saw another article that says that Yarn does manage cpu, disk and network, yet didn’t mention RAM).

Another benefit of a more modularized architecture is that it makes the system easier to maintain. Any updates to MapReduce 1.0 requires the replacement of a pretty big chunk of software. Being able to run multiple versions of MapReduce within a cluster of thousands of nodes is important. Significant downtime would otherwise be required for upgrades.

Source:

Using Yarn to monitor resources and provision capacity in order to run other applications alongside MapReduce

Hadoop 2.0 enables clusters to grow as large as 4000 nodes within deployments that contain multiple clusters. I think that companies like Google and Facebook each run tens of thousands of nodes.

Using Yarn, developers can run additional applications within the cluster by monitoring what the applications need, and then creating CPU/RAM containers within the cluster (and across clusters?) to run them.

There’s speculation that eventually Yarn could provide a PaaS using Hadoop in order to compete with VMWare’s Cloud Foundry. I suppose that while with VMWare you need to first think in terms of virtualizing hardware components and an operating system, Yarn jumps past that to provide an environment that’s abstracted for a specific application.

Source:

Apache Ambari: A suite of applications/components to provision, manage, and monitor Hadoop clusters

System Admins:

Provision

  • Wizard for installing/configuring Hadoop services across many hosts

Manage

  • Start, stop, reconfigure Hadoop across many hosts

Monitor

  • Dashboard for health & status
  • Metrics via Ganglia (Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids)
  • Alerting via Nagios

Developers:

  • Integrate provisioning, mangement, and monitoring into their own application using the Ambari REST APIs

These tools are supported by Ambari:

  • HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop

Sources:

Synchronizing nodes within a Hadoop cluster

Zookeeper is a backend service for managing synchronization within the Hadoop cluster. I saw in one article that there are two kinds of people who mess around with Zookeeper — contributors to the Apache project, and people who are doing something that they shouldn’t be doing.

Source: