Category Archives: HDFS

FUSE security

CDH enables use of Kerberos to securely mount a filesystem via FUSE

Source
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Security-Guide/cdh4sg_topic_12.html

HDFS NFS Proxy is an NFS4 server for HDFS

Clients should access an HDFS mount using Fuse for Production use, but the NFS proxy gets you initially deployed faster. Here’s how.

Source:
https://github.com/cloudera/hdfs-nfs-proxy/wiki

In-memory Hadoop – use it when speed matters

GridGain has a 100% HDFS compatible RAM solution that it claims is 10x faster for IO and network intensive MapReduce processing. I understand the IO, but am not sure why it work help with network intensive operations.  It can be used standalone or along with disk based HDFS as a cache. It is compatible with all Hadoop distributions as well as standard tools like HBase, Hive, etc.

Source:

Email indexing using Cloudera Search

This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.

Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).

The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.

Searching and viewing the results can be done using the Solr GUI or Hue’s search application.

Sources:

eBay discusses failover and time to recovery with HBase containing tens of petabytes of data

eBay worked with HortonWorks and ScaledRisk to improve Mean Time to Recovery (MTTR). Not only did this require faster recovery time, but also faster detection of failures.

The types of failures considered included the following, but only Node/Region server failures were included in the tests. The HBase tables contained 900 million rows.

  • Node/Region server failed while writing
  • Node/Region server failed while reading
  • Rack failure
  • Whole cluster failure
  • Machine reboot (due to CPU temperature)
  • NIC speed steps down to 100Mb/s from gigabit speeds

The tests had favorable results, with improvements submitted (some implemented, some proposed) into Apache HBase and HDFS.

Sources:

HortonWorks tutorial on streaming server log data into HDFS using Flume

Source:

Cloudera Search Engine

Cloudera has announced a realtime search engine running on top of HBase and HDFS, enabling natural language keyword searches.

Indices are stored in HDFS and indexing takes place in batches using MapReduce. Realtime indexing happens via Flume and the Lily HBase indexer.

Source: