Category Archives: HortonWorks

Facebook compresses its 300 petabyte Hadoop Hive data warehouse layer by factor of 8x

Facebook’s 300 PB data warehouse grows by approximately 600 TB per day and resides on more than 100k servers (although I’m not certain how many of those are Hadoop nodes). With the brute force approach of more storage and more servers reaching a logistical limit, the Facebook engineers have increased their level of data compression to 8x (using a custom modification of the Hortonworks ORCFile) from a previous 5x (using RCFile) compression. The Hortonworks ORCFile is generally faster than RCFile when reading, but is slower on writing. Facebook’s custom ORCFile was always fastest on both read and write and also the best compression.

Source:

Hortonworks Sandbox Hive tutorial

I much preferred this tutorial in Hive, rather than the previous one using Pig. Using the same dataset in each example made the comparison clearer.

Pig makes sense for sequential steps, such as an ETL job. Hive seemed better suited for tasks comparable to ones in which we’d write stored procedures within a more traditional database server.

Another difference came with debugging.

  • The Pig editor bundled into the Hortonworks sandbox isn’t very sophisticated as IDEs go. No breakpoints, viewing of data, etc. Perhaps there’s a way to accomplish this, but (thankfully) it isn’t covered in such an early stage tutorial. There’s a button to upload a UDF jar, so I’ve got to research how one develops that jar outside of the Pig script editor.
  • The Hive tutorial makes it easier to view progress at each step, since you can think of each step as an independent SQL (actually HiveQL) statement. If the programming task were far more complex, I could see myself structuring the Pig scripts in a way that might be easier to debug than Hive.
  • Hive seemed good for an ad-hoc query and Pig for a complex procedural task.
  • The next tutorial combines Pig and Hive. I’ll see how that shapes my perceptions.

Hortonworks Sandbox Pig tutorial

I just completed the Hortonworks Pig tutorial. Seemed very straight forward, yet I ran into one problem.

The PIG script as specified was:

batting = load ‘Batting.csv’ using PigStorage(‘,’);
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

Yet it generated an error. I wasn’t able to understand the logs well enough (yet!) to debug it, so fell back to Google’ing it and found this.

http://hortonworks.com/community/forums/topic/error-while-running-sand-box-tutorial-for-pig-script/

Best I can understand, the input data has column headers yet the script assumes no column headers. So the fix is to filter out any row with non-numeric data.

batting = load ‘Batting.csv’ using PigStorage(‘,’);
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = filter runs_raw by runs > 0;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

I suppose that there’s also a way to filter out the first row but my Pig isn’t anywhere near good enough for that.

Other than that, Pig seems interesting. Sort of a procedural programming language version of a subset of what the next tutorial shows us in Hive.

I spent some time today using the Hortonworks Hadoop sandbox

I downloaded the Hortonworks sandbox today. I’m using the version that runs as a virtual machine under Oracle VirtualBox. The sandbox can run in as little as 2GB RAM, but requires 4GB in order to enable Ambari and HBase. Good thing that I have 8GB in my laptop.

The “Hello World” tutorial provided me with hands on:

  • Uploading a file into HCatalog
  • Typing queries into Beeswax, which is a GUI into Hive
  • Running a more complex query by writing a short script in Pig

There are a lot more tutorials. I’ll update this blog post after I finish each tutorial.

Sources:

Fast Search and Analytics on Hortonworks with Elasticsearch

Elasticworks enables real-time searching and analytics. Yarn is supported. Integration extends into Hive and Pig.

Source:

Databricks to commercialize Spark and Shark in-memory processing

Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.

Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.

Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.

Source:

Perspective of Big Data from people who do high frequency electronic trading (HFT)

There’s limited potential for improvements in throughput of high performance transactional databases. Financial institutions are looking to Hadoop to supplement their application stack, but need to accept these cultural differences.

  • Data quality is not 100%. Must use algorithms to refine on an ongoing basis during the transaction. Otherwise look for use cases where close is close enough. For example, Spotify uses Hadoop (HortonWorks) to select song recommendations. You probably wouldn’t use Big Data results to make decisions on which stocks to trade, at what second, at what price to trade them.
  • Batch will never be real time. Some users are able to get algorithms to complete in hours rather than days, but even if the hours can be reduced to some number of minutes, the evolution of Hadoop does not seem to be approaching real time.

source:

Spotify Embraces Hortonworks, Dumps Cloudera

Cloudera had appeared to be the defacto standard of Hadoop distributions, but Hortonworks has scored big in this deal. Spotify has a 690 node Cloudera cluster that it will be moving to a Hortonworks cluster (undisclosed size). Apparently it’s the new Hive implementation that makes Hortonworks so attractive.

When Spotify launched in 2008 it had a 30 node cluster hosted on Amazon’s AWS, then switched to an on-premises 60 node cluster that grew to 690 nodes. The cluster currently contains 4 petabytes of data which grows by 200 gigabytes per day.

Spotify has a 12 person Hadoop team and uses a Python (not Java) framework for batch processing.

Source:

Proposed updates to Hive to support ACID transactions

HortonWorks developed solutions to add into Hive the ability to update multiple records as a single transaction following the ACID model. Part of the complexity of transactional updates is that the data must be written to all applicable nodes before the transaction can be considered complete. The naming convention within HDFS folders includes a transaction ID so that both committed and uncommitted files persist until all portions of the transaction have been completed. Because the transaction ID is included, any read operations that occur before the transaction has completed will access the old data.

Why go through all of this work to add an ACID model to Hive rather than just use HBase, which already supports transactions. The primary reason is that HBase only supports Consistency at the level of a single row update, rather than with a larger set of operations. Without Consistency, there is no ACID. HortonWorks lists a few other reasons, but I’m discounting them because they are general reasons why they prefer Hive over HBase.

Source:

eBay discusses failover and time to recovery with HBase containing tens of petabytes of data

eBay worked with HortonWorks and ScaledRisk to improve Mean Time to Recovery (MTTR). Not only did this require faster recovery time, but also faster detection of failures.

The types of failures considered included the following, but only Node/Region server failures were included in the tests. The HBase tables contained 900 million rows.

  • Node/Region server failed while writing
  • Node/Region server failed while reading
  • Rack failure
  • Whole cluster failure
  • Machine reboot (due to CPU temperature)
  • NIC speed steps down to 100Mb/s from gigabit speeds

The tests had favorable results, with improvements submitted (some implemented, some proposed) into Apache HBase and HDFS.

Sources: