Category Archives: Sqoop

Understanding Connectors and Drivers in the World of Sqoop

Sqoop is a tool for efficient and large loads/extracts between RDMS and Hadoop.

This ecosystem has enough made up words that it’s important to get the commonplace industry standard words correct — “JDBC Driver” and “JDBC Connector”.

  • Driver is a JDBC driver.
  • Connector could be generic or vendor specific
    • Sqoop’s Generic JDBC connector is always available as part of the standard distribution.
    • Also includes connectors for MySQL, PostgreSQL, Oracle, MS SQL, IBM DB2, and Neteza. However, the DB vendors (or someone else) might have customized/optimized connectors.
    • If the programmer doesn’t select a connector, or if the data source is not known until runtime, Sqoop can try to figure out what the appropriate connector is. Sometimes this is easy, such as if the url to access the data is something like jdbc::myslq//…

Source:

Apache Ambari: A suite of applications/components to provision, manage, and monitor Hadoop clusters

System Admins:

Provision

  • Wizard for installing/configuring Hadoop services across many hosts

Manage

  • Start, stop, reconfigure Hadoop across many hosts

Monitor

  • Dashboard for health & status
  • Metrics via Ganglia (Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids)
  • Alerting via Nagios

Developers:

  • Integrate provisioning, mangement, and monitoring into their own application using the Ambari REST APIs

These tools are supported by Ambari:

  • HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop

Sources:

Article on IBM DeveloperWorks

Open Source Big Data for the Impatient, Part 1: Hadoop tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL

Source:

Moving data between Hadoop and relational databases

Sqoop

  • Tool for bi-directional data between Hadoop and relational database using JDBC.
  • Optimized drivers for specific database vendors are available.
  • Command line tool

Flume and FlumeNG (Next Generation)

  • Enables realtime streaming into HDFS and HBase.
  • The use case for Flume is for streaming of data, such as continual input from web server logs.

Sources: