Tag Archives: hbase.apache.org

Apache Hadoop and its components

Hadoop consists of two components

  1. MapReduce –
    • programming framework
    • Map
      • distributes work to different Hadoop nodes
    • Reduce
      • gathers results from multiple nodes and resolves them into a single value
      • the source come from HDFS, and the output is typically written back to HDFS
    • Job Tracker: manages nodes
    • Task Tracking: takes orders from Job Traker
    • MapReduce originally developed by Google.
    • Apache MapReduce is built on top of Apache YARN which is a framework for job scheduling and cluster resource management.
  2. HDFS (Hadoop Distributed File System) – file store
    • It is neither a file system nor a database, it’s neither yet it’s both.
    • Within HDFS are two components
      • Data Nodes:
        • data repository
      • Name Nodes:
        • where to find the data; maps blocks of data on slave nodes (where job and task trackers are running)
        • Open, Close, Rename files
    • On top of HDFS you can run HBase
      • Super scalable (billions of rows, millions of columns)┬árepository for key-value pairs
      • This is not a database, cannot have multiple indices