Category Archives: Flume

Email indexing using Cloudera Search

This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.

Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).

The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.

Searching and viewing the results can be done using the Solr GUI or Hue’s search application.

Sources:

HortonWorks tutorial on streaming server log data into HDFS using Flume

Source:

Cloudera Search Engine

Cloudera has announced a realtime search engine running on top of HBase and HDFS, enabling natural language keyword searches.

Indices are stored in HDFS and indexing takes place in batches using MapReduce. Realtime indexing happens via Flume and the Lily HBase indexer.

Source:

Article on IBM DeveloperWorks

Open Source Big Data for the Impatient, Part 1: Hadoop tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL

Source:

Moving data between Hadoop and relational databases

Sqoop

  • Tool for bi-directional data between Hadoop and relational database using JDBC.
  • Optimized drivers for specific database vendors are available.
  • Command line tool

Flume and FlumeNG (Next Generation)

  • Enables realtime streaming into HDFS and HBase.
  • The use case for Flume is for streaming of data, such as continual input from web server logs.

Sources: