Category Archives: MapReduce

Hadoop job scheduling that takes network bandwidth into account

A research paper from Cornell University discusses scheduling Hadoop jobs based upon an analysis of available network bandwidth. Typically a Hadoop cluster only considers server node availability when scheduling. Software Defined Networking (SDN) is assumed. SDN is a new front in virtualization technology and critical for dynamic scaling of clouds.

Source:

Machine learning on its way from Cloudera?

In 2013 Cloudera acquired a company called Myrrix, which has morphed into project (not yet a product) called Oryx. The system still uses MapReduce, which is not optimal. Before is becomes a product it’ll be rewritten using Spark.

Oryx will enable construction of machine learning models that can process data in real time. Possible use cases are spam filters and recommendation engines (which seems to be its sweet spot).

This competes with Apache Mahout, which processes in batch mode only.

Source:

WibiEnterprise bridges between Hadoop and the application layer

The core features of WibiEnterprise 3.0 are frameworks that enable:

  • defining schemas in realtime
  • layer on top of MapReduce
  • model lifecycle (machine learning, batch training, development, scoring)
  • ad hoc queries
  • RESTful interfaces

Source:

Databricks to commercialize Spark and Shark in-memory processing

Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.

Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.

Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.

Source:

Email indexing using Cloudera Search

This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.

Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).

The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.

Searching and viewing the results can be done using the Solr GUI or Hue’s search application.

Sources:

Yarn provides greater scalability than MapReduce

MapReduce could have resource utilization problems because an arbitrary process could have allocated all map slots, while some reduce slots are empty (and vise-versa). Yarn (Yet Another Resource Negotiator) splits the JobTracker into a global ResourceManager (RM) and a per-application ApplicationMaster (AM) which works with the per-machine NodeManagers to execute and monitor tasks.  The ResourceManager has a Scheduler which only schedules (does not monitor).

The ApplicationMaster is far more scalable in 2.0 than 1.0. HortonWorks has successful simulated a 10k node cluster. This is possible due to the ApplicationMaster not being global to the entire cluster but rather has an instance per application so is no longer a bottleneck through which all applications must pass.

The ResourceManager is also more scalable in 2.0 since it’s scope is reduced to scheduling and no longer is responsible for fault tolerance of the entire cluster.

Source:

Syncsort’s DMX-h product moves data into Hadoop from mainframes

DMX-h snaps into MapReduce, enabling copying of data from Cobol copybooks, although there’s not currently an integration down into IMS and VSAM. However, Syncort also has a product that enables data to be moved from IMS and VSAM into DB2/z, and applications think that they’re still accessing IMS and VSAM. This could eventually enable integration between DB2/z and Hadoop.

Source:

10 Key/Value Store, Distributed, Open Source Databases

Riak

  • HTTP API
  • Master-less, so remains operational even if multiple nodes fail
  • Near linear scalability
  • Architecture same of both large and small clusters
  • Key/value model, flat namespace, can store anything

Redis

  • Key/value. Can store data types such as sets, sorted lists, hashes and do operations on them such as set intersection and incrementing the value in a hash.
  • In-memory dataset
  • Easy to setup, master/slave replication

Hibari

  • Very simple data model with 5 attributes: keys, values, timestamps, expiry date, flags for metadata
  • Chain replication across nodes that are geographically dispersed. Not single points of failure
  • Excellent performance for large batches (~200k) read/write operations
  • Runs on commodity hardware or blades. Does not require SAN

Hypertable

  • High performance, massively scalable, modeled after Google’s Bigtable
  • Runs on top of a distributed file system such as Apache Hadoop DFS, GlusterDS, or Kosmos File System
  • Data model is a traditional, but huge table, that is physically stored in sort order of the primary key

Voldemort

  • High scalability due to allowing only very simple key/value data access.
  • Used by LinkedIn
  • Not an object or a relational database. Just a big, distributed, fault-tolerant, persistent hash table
  • Includes in-memory caching, so separate caching tier isn’t required

MemcacheDB

  • High performance persistent storage that’s compatible with Memcache protocol

Tarantool

  • NoSQL database with messaging server
  • All data maintained in RAM. Persistence via a write ahead log.
  • Asynchronous replication and hot standby
  • Supports stored procedures
  • Data model: tuples (unique key plus any number of other fields); spaces (multiple tuples)

Apache Cassandra

  • Can use massive cluster of commodity servers with no single point of failure. Can be deploy across multiple data centers.
  • Was used by Facebook for Inbox Search until 2010
  • Read/write scales linearly with number of nodes
  • Data replicated across multiple nodes
  • Supports MapReduce, Pig, and Hive
  • Has SQL-like CQL providing for a hybrid between key/value and tabular database

HyperDex

  • NoSQL key/value that provides lower latency and higher throughput than some alternatives
  • Replicates data to multiple nodes
  • Very easy to administer and maintain
  • Data model: key plus zero or more attributes

Lightcloud

  • Great performance even on small clusters with millions of keys
  • Nodes replicated via master-to-master replication.  Hot backups and restores
  • Very small client footprint
  • Built on top of Tokyo Tyrant

Sources:

Cloudera Search Engine

Cloudera has announced a realtime search engine running on top of HBase and HDFS, enabling natural language keyword searches.

Indices are stored in HDFS and indexing takes place in batches using MapReduce. Realtime indexing happens via Flume and the Lily HBase indexer.

Source:

How to Plan and Configure YARN

Good (and very short) article about configuring Yarn, specifically allocating CPUs and RAM for the OS, JVMs, and MapReduce

Source: