Category Archives: cloudera

Machine learning on its way from Cloudera?

In 2013 Cloudera acquired a company called Myrrix, which has morphed into project (not yet a product) called Oryx. The system still uses MapReduce, which is not optimal. Before is becomes a product it’ll be rewritten using Spark.

Oryx will enable construction of machine learning models that can process data in real time. Possible use cases are spam filters and recommendation engines (which seems to be its sweet spot).

This competes with Apache Mahout, which processes in batch mode only.

Source:

FUSE security

CDH enables use of Kerberos to securely mount a filesystem via FUSE

Source
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Security-Guide/cdh4sg_topic_12.html

Databricks to commercialize Spark and Shark in-memory processing

Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.

Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.

Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.

Source:

Spotify Embraces Hortonworks, Dumps Cloudera

Cloudera had appeared to be the defacto standard of Hadoop distributions, but Hortonworks has scored big in this deal. Spotify has a 690 node Cloudera cluster that it will be moving to a Hortonworks cluster (undisclosed size). Apparently it’s the new Hive implementation that makes Hortonworks so attractive.

When Spotify launched in 2008 it had a 30 node cluster hosted on Amazon’s AWS, then switched to an on-premises 60 node cluster that grew to 690 nodes. The cluster currently contains 4 petabytes of data which grows by 200 gigabytes per day.

Spotify has a 12 person Hadoop team and uses a Python (not Java) framework for batch processing.

Source:

Email indexing using Cloudera Search

This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.

Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).

The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.

Searching and viewing the results can be done using the Solr GUI or Hue’s search application.

Sources:

Hortonworks vs Cloudera

The differences between the Hadoop distributions between these two companies can be summarized as:

  • Hortonworks contributes to Apache applications that others aren’t distributing, but the Hortonworks distribution is 100% open source
  • Cloudera contributes to core Apache applications, but also includes proprietary applications in its Hadoop distribution

Source:

Cloudera Search Engine

Cloudera has announced a realtime search engine running on top of HBase and HDFS, enabling natural language keyword searches.

Indices are stored in HDFS and indexing takes place in batches using MapReduce. Realtime indexing happens via Flume and the Lily HBase indexer.

Source:

HortonWorks trying to make Hive faster, contrasting it to Impala

Hive was invented by Facebook as a data warehouse layer on top of Hadoop, and has been adopted by HortonWorks. The benefit of Hive is that it enables programmers, which years of experince in relational databases, to write MapReduce jobs using SQL. The problem is that MapReduce is slow, and Hive slows it down even further.

HortonWorks is pushing for optimization (via project Stinger) of the developer friendly toolset provided by Hive. Cloudera has abandoned Hive in favor of Impala. Rather than translate SQL queries into MapReduce, Impala implemens a massively parallel relational database ontop of HDFS.

Sources:

Cisco UCS Common Platform Architecture (CPA) for Big Data with Cloudera

Cisco has created two reference architectures for Cloudera’s CDH “high performance” and “high capacity” configurations. I suppose that there’s no need for a reference architecture for “light processing” or “balanced”.

Source:

Using the R programming language with Hadoop to create graphical views of statistical models

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced.

The update to R, from Revolution Analytics, is significant because previously data had to be moved into a R environment in order to process and plot the data. This update enables R to run within the Cloudera Hadoop enviornment so that data does not need to be moved out of HDFS, across the network, and onto another machine for processing.

I think that this is significant because R enables a single page graphic to represent the analysis on data (that is potentially petabytes in size). Seems to me that R takes as input the data that is generated by the Reduce portion of MapReduce.

Sources: