At its core, what the NSA is doing is finding anti-patterns. Crunching through huge sets of non-interesting data is the only way to find the interesting data.
Also, the Department of Defense sees the success that NSA is having with Hadoop technologies, and is considering using it (most likely Accumulo) to store large amounts of unstructured and non-schema data.
Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.
Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.
Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.
Posted in C++, cloudera, Hive, HortonWorks, Impala, Java, MapReduce, performance, Python, Rails, Scala, Shark, Spark, SQL, Twitter
Tagged apache.org, berkeley.edu, databricks.com, gigaom.com, scala-lang.org
Hadoop mindset glorifies having as much raw data as possible. Just build more nodes if necessary. However, there is currently a lack of good meta data tools. Where did the data come from? What’s the retention policy? Who has access to read it, delete it?
CTO of Sqrrl thinks this is a result of the Hadoop environment being designed for developers, not for business users.
Hive was invented by Facebook as a data warehouse layer on top of Hadoop, and has been adopted by HortonWorks. The benefit of Hive is that it enables programmers, which years of experince in relational databases, to write MapReduce jobs using SQL. The problem is that MapReduce is slow, and Hive slows it down even further.
HortonWorks is pushing for optimization (via project Stinger) of the developer friendly toolset provided by Hive. Cloudera has abandoned Hive in favor of Impala. Rather than translate SQL queries into MapReduce, Impala implemens a massively parallel relational database ontop of HDFS.
Posted in cloudera, Data Warehouse, Facebook, hadoop, HDFS, Hive, HortonWorks, Impala, MapReduce, Relational DB, SQL, Stinger
Tagged gigaom.com, hortonworks.com
Storm is an open sourced system (from Twitter) that processes streams of big data in realtime (but without 100% guaranteed accuracy), making it the opposite of Hadoop which processes a repository of big data in batch.
Twitter has needs for both streaming and batch, so created an open sourced hybrid system called Summingbird. It does what Storm does, then uses Hadoop for error correction.
Twitter’s use cases include updating timelines and trending topics in real time, but then making sure that the analytics are accurate.
Yahoo’s contribution to this effort was to enable Storm to be configured using Yarn.