Category Archives: NoSQL

Interesting use case about migrating away from SQL to Hadoop and NoSQL

Paytronix analyzes data from 8,000 restaurants that adds up to a few tens of terrabytes of data. Not that complex in terms of volume, but there are a lot of data fields and potential reports. They migrated from MS SQL Sever and constantly evolving ETL jobs to Hadoop and MongoDB with a lot of success.

source:

Summary of Terradata’s big data approach

  • Terradata Aster 6 platform
  • Includes graph analysis engine (visualization), in addition to traditional rows/columns.
  • Enables execution of SQL across multiple NoSQL repositories
  • Integrates with multiple 3rd parties for solutions such as analytical workflow (Alteryx), advanced analytics algorithms (Fuzzy Logix).
  • Cloud services at comparable cost to on-premises

Source

 

Hadoop and MongoDB Use Cases

  1. Batch aggregation of data processed in Hadoop, and then stored in MongoDB for later ad-hoc analysis
  2. Staging area for batch loads into Hadoop
  3. Using MapReduce for complex ETL migrations

Source:

10 Key/Value Store, Distributed, Open Source Databases

Riak

  • HTTP API
  • Master-less, so remains operational even if multiple nodes fail
  • Near linear scalability
  • Architecture same of both large and small clusters
  • Key/value model, flat namespace, can store anything

Redis

  • Key/value. Can store data types such as sets, sorted lists, hashes and do operations on them such as set intersection and incrementing the value in a hash.
  • In-memory dataset
  • Easy to setup, master/slave replication

Hibari

  • Very simple data model with 5 attributes: keys, values, timestamps, expiry date, flags for metadata
  • Chain replication across nodes that are geographically dispersed. Not single points of failure
  • Excellent performance for large batches (~200k) read/write operations
  • Runs on commodity hardware or blades. Does not require SAN

Hypertable

  • High performance, massively scalable, modeled after Google’s Bigtable
  • Runs on top of a distributed file system such as Apache Hadoop DFS, GlusterDS, or Kosmos File System
  • Data model is a traditional, but huge table, that is physically stored in sort order of the primary key

Voldemort

  • High scalability due to allowing only very simple key/value data access.
  • Used by LinkedIn
  • Not an object or a relational database. Just a big, distributed, fault-tolerant, persistent hash table
  • Includes in-memory caching, so separate caching tier isn’t required

MemcacheDB

  • High performance persistent storage that’s compatible with Memcache protocol

Tarantool

  • NoSQL database with messaging server
  • All data maintained in RAM. Persistence via a write ahead log.
  • Asynchronous replication and hot standby
  • Supports stored procedures
  • Data model: tuples (unique key plus any number of other fields); spaces (multiple tuples)

Apache Cassandra

  • Can use massive cluster of commodity servers with no single point of failure. Can be deploy across multiple data centers.
  • Was used by Facebook for Inbox Search until 2010
  • Read/write scales linearly with number of nodes
  • Data replicated across multiple nodes
  • Supports MapReduce, Pig, and Hive
  • Has SQL-like CQL providing for a hybrid between key/value and tabular database

HyperDex

  • NoSQL key/value that provides lower latency and higher throughput than some alternatives
  • Replicates data to multiple nodes
  • Very easy to administer and maintain
  • Data model: key plus zero or more attributes

Lightcloud

  • Great performance even on small clusters with millions of keys
  • Nodes replicated via master-to-master replication.  Hot backups and restores
  • Very small client footprint
  • Built on top of Tokyo Tyrant

Sources:

Sqrrl Enterprise, Accumulo, and Encryption

Sqrrl is powered by Apache Accumulo, which was originally developed for the NSA in 2008, is a low latency NoSQL database using Hadoop as its file system.

  • Support for both role based and attribute based security controls
  • Encryption at rest and in motion
  • Can use multiple keys
  • Trust boundaries limit the admin’s access to data
  • Impact of encryption is only about 10% performance degradation

Sources:

Cassandra – NoSQL database to use in conjunction with Hadoop

Some use cases feed data directly into Hadoop from their source (such as web server logs), but others feed into Hadoop from a database repository. Still others have use cases in which there is a massive output of data that needs to be stored somewhere for post-processing. One model for handling this dataset is a NoSQL database, as opposed to SQL or flat files.

Cassandra is an Apache project that is popular for its integration into the Hadoop ecosystem. It can be used with components such as Pig, Hive, and Oozie. Cassandra is often used as a replacement for HDFS and HBase since Cassandra has no master node, so eliminates a single point of failure (and need for traditional redundancy). In theory, its scalability is strictly linear; doubling the number of nodes will exactly double the number of transactions that can be processed per second. It also supports triggers; if monitoring detects that triggers are running slowly, then additional nodes can be programmatically deployed to address production performance problems.

Cassandra was first developed by Facebook. The primary benefit of its easily distributed infrastructure is the ability to handle large amount of reads and writes. The newest version (2.0) solves many of the usability problems encountered by programmers.

DataStax provides a commercially packaged version of Cassandra.

MongoDB is a good non-HBase alternative to Cassandra.

Sources:

JSON and Big Data

JSON is a good fit for NoSQL databases, and for analysis within Hadoop because it uses key/value pairs. Keeping the same datamodel throughout an application (from Hadoop, to a NoSQL db, to a web front end that uses JSON) might make sense.