Tag Archives: hortonworks.com

Facebook compresses its 300 petabyte Hadoop Hive data warehouse layer by factor of 8x

Facebook’s 300 PB data warehouse grows by approximately 600 TB per day and resides on more than 100k servers (although I’m not certain how many of those are Hadoop nodes). With the brute force approach of more storage and more servers reaching a logistical limit, the Facebook engineers have increased their level of data compression to 8x (using a custom modification of the Hortonworks ORCFile) from a previous 5x (using RCFile) compression. The Hortonworks ORCFile is generally faster than RCFile when reading, but is slower on writing. Facebook’s custom ORCFile was always fastest on both read and write and also the best compression.


I spent some time today using the Hortonworks Hadoop sandbox

I downloaded the Hortonworks sandbox today. I’m using the version that runs as a virtual machine under Oracle VirtualBox. The sandbox can run in as little as 2GB RAM, but requires 4GB in order to enable Ambari and HBase. Good thing that I have 8GB in my laptop.

The “Hello World” tutorial provided me with hands on:

  • Uploading a file into HCatalog
  • Typing queries into Beeswax, which is a GUI into Hive
  • Running a more complex query by writing a short script in Pig

There are a lot more tutorials. I’ll update this blog post after I finish each tutorial.


Fast Search and Analytics on Hortonworks with Elasticsearch

Elasticworks enables real-time searching and analytics. Yarn is supported. Integration extends into Hive and Pig.


Big Data Tutorials

Proposed updates to Hive to support ACID transactions

HortonWorks developed solutions to add into Hive the ability to update multiple records as a single transaction following the ACID model. Part of the complexity of transactional updates is that the data must be written to all applicable nodes before the transaction can be considered complete. The naming convention within HDFS folders includes a transaction ID so that both committed and uncommitted files persist until all portions of the transaction have been completed. Because the transaction ID is included, any read operations that occur before the transaction has completed will access the old data.

Why go through all of this work to add an ACID model to Hive rather than just use HBase, which already supports transactions. The primary reason is that HBase only supports Consistency at the level of a single row update, rather than with a larger set of operations. Without Consistency, there is no ACID. HortonWorks lists a few other reasons, but I’m discounting them because they are general reasons why they prefer Hive over HBase.


eBay discusses failover and time to recovery with HBase containing tens of petabytes of data

eBay worked with HortonWorks and ScaledRisk to improve Mean Time to Recovery (MTTR). Not only did this require faster recovery time, but also faster detection of failures.

The types of failures considered included the following, but only Node/Region server failures were included in the tests. The HBase tables contained 900 million rows.

  • Node/Region server failed while writing
  • Node/Region server failed while reading
  • Rack failure
  • Whole cluster failure
  • Machine reboot (due to CPU temperature)
  • NIC speed steps down to 100Mb/s from gigabit speeds

The tests had favorable results, with improvements submitted (some implemented, some proposed) into Apache HBase and HDFS.


Yarn provides greater scalability than MapReduce

MapReduce could have resource utilization problems because an arbitrary process could have allocated all map slots, while some reduce slots are empty (and vise-versa). Yarn (Yet Another Resource Negotiator) splits the JobTracker into a global ResourceManager (RM) and a per-application ApplicationMaster (AM) which works with the per-machine NodeManagers to execute and monitor tasks.  The ResourceManager has a Scheduler which only schedules (does not monitor).

The ApplicationMaster is far more scalable in 2.0 than 1.0. HortonWorks has successful simulated a 10k node cluster. This is possible due to the ApplicationMaster not being global to the entire cluster but rather has an instance per application so is no longer a bottleneck through which all applications must pass.

The ResourceManager is also more scalable in 2.0 since it’s scope is reduced to scheduling and no longer is responsible for fault tolerance of the entire cluster.


HortonWorks tutorial on streaming server log data into HDFS using Flume


Hadoop’s part in a Modern Data Architecture

Hadoop must:

  • Integrate with existing infrastructure. You can’t expect a green field. It can certainly replace some existing components, but will need to augment others even if it capable of replacing them.
  • Utilize existing staff. Hive isn’t optimal, but there is value in having existing staff who understand existing systems, business needs, and data sets work in the new Hadoop environment.


How to Plan and Configure YARN

Good (and very short) article about configuring Yarn, specifically allocating CPUs and RAM for the OS, JVMs, and MapReduce