Category Archives: apache

Hydra is a non-Hadoop database for realtime analysis of dynamic data

Hydra is not built on top of Hadoop, but functions similar to Summingbird, Storm, and Spark.

Data can stream into it, and analytics can be run in real time, rather than only in batch.

AddThis is the company that originally developed Hydra, which is now in open sourced through Apache. AddThis runs six Hydra clusters, one of which is comprised of 156 servers and processes 3.5 billion transactions per day.

Sources:

Email indexing using Cloudera Search

This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.

Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).

The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.

Searching and viewing the results can be done using the Solr GUI or Hue’s search application.

Sources:

benefits of a single vendor hadoop systems

The Hadoop ecosystem contains many application components. Configuring them (memory, ram, etc) is challenging so getting a full distribution from a single vendor, as opposed to downloading it all from Apache sites or from multiple vendors, can be a good idea.

The application components interface with each other via Java APIs. One poorly written piece of custom Java code can result in performance problems that cascade throughout the entire system. You don’t want to write these integrations yourself. Let the vendor provide it.

source:

Hortonworks vs Cloudera

The differences between the Hadoop distributions between these two companies can be summarized as:

  • Hortonworks contributes to Apache applications that others aren’t distributing, but the Hortonworks distribution is 100% open source
  • Cloudera contributes to core Apache applications, but also includes proprietary applications in its Hadoop distribution

Source:

10 Key/Value Store, Distributed, Open Source Databases

Riak

  • HTTP API
  • Master-less, so remains operational even if multiple nodes fail
  • Near linear scalability
  • Architecture same of both large and small clusters
  • Key/value model, flat namespace, can store anything

Redis

  • Key/value. Can store data types such as sets, sorted lists, hashes and do operations on them such as set intersection and incrementing the value in a hash.
  • In-memory dataset
  • Easy to setup, master/slave replication

Hibari

  • Very simple data model with 5 attributes: keys, values, timestamps, expiry date, flags for metadata
  • Chain replication across nodes that are geographically dispersed. Not single points of failure
  • Excellent performance for large batches (~200k) read/write operations
  • Runs on commodity hardware or blades. Does not require SAN

Hypertable

  • High performance, massively scalable, modeled after Google’s Bigtable
  • Runs on top of a distributed file system such as Apache Hadoop DFS, GlusterDS, or Kosmos File System
  • Data model is a traditional, but huge table, that is physically stored in sort order of the primary key

Voldemort

  • High scalability due to allowing only very simple key/value data access.
  • Used by LinkedIn
  • Not an object or a relational database. Just a big, distributed, fault-tolerant, persistent hash table
  • Includes in-memory caching, so separate caching tier isn’t required

MemcacheDB

  • High performance persistent storage that’s compatible with Memcache protocol

Tarantool

  • NoSQL database with messaging server
  • All data maintained in RAM. Persistence via a write ahead log.
  • Asynchronous replication and hot standby
  • Supports stored procedures
  • Data model: tuples (unique key plus any number of other fields); spaces (multiple tuples)

Apache Cassandra

  • Can use massive cluster of commodity servers with no single point of failure. Can be deploy across multiple data centers.
  • Was used by Facebook for Inbox Search until 2010
  • Read/write scales linearly with number of nodes
  • Data replicated across multiple nodes
  • Supports MapReduce, Pig, and Hive
  • Has SQL-like CQL providing for a hybrid between key/value and tabular database

HyperDex

  • NoSQL key/value that provides lower latency and higher throughput than some alternatives
  • Replicates data to multiple nodes
  • Very easy to administer and maintain
  • Data model: key plus zero or more attributes

Lightcloud

  • Great performance even on small clusters with millions of keys
  • Nodes replicated via master-to-master replication.  Hot backups and restores
  • Very small client footprint
  • Built on top of Tokyo Tyrant

Sources:

Apache Samza: LinkedIn’s Real-time Stream Processing Framework

  • Samza is a massively scalable framework for distributed stream transport and limited processing
  • Samza uses Yarn and Apache Kafka (publish/subscribe messaging able to handle 100s of MB reads/writes per second)
  • LinkedIn utilizes Samza to publish 26+ billion unique messages per day to 100s of message feeds that are picked up by 1000s of automated subscribers (some are real time, others batch)

Sources:

HortonWorks / Apache Tez

HortonWorks / Apache Tez provides an alternative to MapReduce in order to process near real time jobs at petabyte scale. The HortonWorks Stinger project utilizes Tez in order to increase the speed of Hive and Pig by an order (or multiple orders) of magnitude.

Tez is based on a multiple stage dataflow architecture: pre-processor, sampler, partition, aggregate in contract to the traditional Map and Reduce.

Tez assumes use of Yarn for resource acquisition, so cannot be run in legacy environments. Also assumed is complex user defined logic to eliminate duplicate work in order to increase performance. Legacy Hadoop assumes duplicate work, made less painful by the massive scale of the cluster, and the benefit of redundancy.

Tez may also run multiple instances within a single Yarn container, which reduces the overhead of additional containers. However, this may decrease efficient resource utilization on a very large scale since using many Yarn containers help to allocate every last available hardware resource, as opposed to Tez squeezing as much as possible within fewer containers.

Source:

Cassandra – NoSQL database to use in conjunction with Hadoop

Some use cases feed data directly into Hadoop from their source (such as web server logs), but others feed into Hadoop from a database repository. Still others have use cases in which there is a massive output of data that needs to be stored somewhere for post-processing. One model for handling this dataset is a NoSQL database, as opposed to SQL or flat files.

Cassandra is an Apache project that is popular for its integration into the Hadoop ecosystem. It can be used with components such as Pig, Hive, and Oozie. Cassandra is often used as a replacement for HDFS and HBase since Cassandra has no master node, so eliminates a single point of failure (and need for traditional redundancy). In theory, its scalability is strictly linear; doubling the number of nodes will exactly double the number of transactions that can be processed per second. It also supports triggers; if monitoring detects that triggers are running slowly, then additional nodes can be programmatically deployed to address production performance problems.

Cassandra was first developed by Facebook. The primary benefit of its easily distributed infrastructure is the ability to handle large amount of reads and writes. The newest version (2.0) solves many of the usability problems encountered by programmers.

DataStax provides a commercially packaged version of Cassandra.

MongoDB is a good non-HBase alternative to Cassandra.

Sources:

Apache Ambari: A suite of applications/components to provision, manage, and monitor Hadoop clusters

System Admins:

Provision

  • Wizard for installing/configuring Hadoop services across many hosts

Manage

  • Start, stop, reconfigure Hadoop across many hosts

Monitor

  • Dashboard for health & status
  • Metrics via Ganglia (Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids)
  • Alerting via Nagios

Developers:

  • Integrate provisioning, mangement, and monitoring into their own application using the Ambari REST APIs

These tools are supported by Ambari:

  • HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop

Sources:

Use Cases for Hadoop

The Apache “Powered by Hadoop” page lists a long list of companies that use Hadoop. Some only list the company name. Others have a sentence or two about what they’re using Hadoop for. And some, like LinkedIn, list the specs of the hardware that they have Hadoop running on. There’s also a link to a great article about how the NYTimes used Hadoop  (in 2007!!!) running on Amazon’s AWS cloud to generate PDFs of 11 million articles in 24 hours running on 100 machines. One of the things I find interesting about the NYTimes use case is that they used Hadoop for a one time batch process. A lot what we read about Hadoop assumes that the use case is an ongoing, maybe multi-year application.

Source: