Category Archives: tutorial

Hortonworks Sandbox Hive tutorial

I much preferred this tutorial in Hive, rather than the previous one using Pig. Using the same dataset in each example made the comparison clearer.

Pig makes sense for sequential steps, such as an ETL job. Hive seemed better suited for tasks comparable to ones in which we’d write stored procedures within a more traditional database server.

Another difference came with debugging.

  • The Pig editor bundled into the Hortonworks sandbox isn’t very sophisticated as IDEs go. No breakpoints, viewing of data, etc. Perhaps there’s a way to accomplish this, but (thankfully) it isn’t covered in such an early stage tutorial. There’s a button to upload a UDF jar, so I’ve got to research how one develops that jar outside of the Pig script editor.
  • The Hive tutorial makes it easier to view progress at each step, since you can think of each step as an independent SQL (actually HiveQL) statement. If the programming task were far more complex, I could see myself structuring the Pig scripts in a way that might be easier to debug than Hive.
  • Hive seemed good for an ad-hoc query and Pig for a complex procedural task.
  • The next tutorial combines Pig and Hive. I’ll see how that shapes my perceptions.

Hortonworks Sandbox Pig tutorial

I just completed the Hortonworks Pig tutorial. Seemed very straight forward, yet I ran into one problem.

The PIG script as specified was:

batting = load ‘Batting.csv’ using PigStorage(‘,’);
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

Yet it generated an error. I wasn’t able to understand the logs well enough (yet!) to debug it, so fell back to Google’ing it and found this.

Best I can understand, the input data has column headers yet the script assumes no column headers. So the fix is to filter out any row with non-numeric data.

batting = load ‘Batting.csv’ using PigStorage(‘,’);
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = filter runs_raw by runs > 0;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

I suppose that there’s also a way to filter out the first row but my Pig isn’t anywhere near good enough for that.

Other than that, Pig seems interesting. Sort of a procedural programming language version of a subset of what the next tutorial shows us in Hive.

I spent some time today using the Hortonworks Hadoop sandbox

I downloaded the Hortonworks sandbox today. I’m using the version that runs as a virtual machine under Oracle VirtualBox. The sandbox can run in as little as 2GB RAM, but requires 4GB in order to enable Ambari and HBase. Good thing that I have 8GB in my laptop.

The “Hello World” tutorial provided me with hands on:

  • Uploading a file into HCatalog
  • Typing queries into Beeswax, which is a GUI into Hive
  • Running a more complex query by writing a short script in Pig

There are a lot more tutorials. I’ll update this blog post after I finish each tutorial.


Apache Hive: 5 facts

  1. Hive is a SQL-like layer on top of Hadoop
  2. Use it when you have some sort of structure to your data.
  3. You can use JDBC and ODBC drivers to interface with your traditional systems. However, it’s not high performance.
  4. Originally built by (and still used by) Facebook to bring traditional database concepts into Hadoop in order to perform analytics. Also used by Netflix to run daily summaries.
  5. Pig is sometimes compared to Hive, in that they are both “languages” that are layered on top of Hadoop. However, Pig is more analogous to a procedural language to write applications, while Hive is targeted at traditional DB programmers moving over to Hadoop.


Big Data as a Service provider has free developer account

Founders of Qubole built some of the big data technology at Facebook (scaled to 25 petabytes). Their new company has a hosted Hadoop infrastructure. Interesting small and free accounts take the IT configuration out of learning Hadoop.


Two part article of Hello World for Hadoop


Big Data Tutorials

Email indexing using Cloudera Search

This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.

Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).

The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.

Searching and viewing the results can be done using the Solr GUI or Hue’s search application.


Big Data University has free downloadable courses

Selecting the right hardware for a Hadoop cluster

I’m summarizing this article. For specifics (such as how to configure split machines across racks to better configure the network switches) see the article. None of this content is operating system or hardware vendor specific, but generally the discussions assume Linux.

Goal is to minimize data movement and process on the same machine that stores the data. Therefore each machine in the cluster needs appropriate CPU and disk. Problem is that when building the cluster the nature of the queries and the resulting bottlenecks may not yet be known. If a business is building its first Hadoop cluster, it may not yet fully understand the types of business problems that will eventually be solved by it. That’s in contrast to a business deploying it’s Nth Oracle server.

Types of bottlenecks:

  1. IO: reading from disk (or a network location)
    • indexing
    • data import/export
    • data transformation
  2. CPU: processing the Map query
    • clustering
    • text mining
    • natural language processing

Other issues, since a cluster could eventually scale to hundreds or thousands off machines

  1. Power
  2. Cooling

The Cloudera Manager can provide realtime statistics about how a currently running MapReduce job impacts the CPU, disk, and network load.

Roles of the components within a Hadoop cluster:

  1. Name Node (and Standby Name Node): coordinating data storage on the cluster
  2. Job Tracker: coordinating data processing
  3. Task Tracker
  4. Data Node

Data Node and Task Tracker

  • The vast majority of the machines in a cluster will only peform the roles of Data Node and Task Tracker, which should not be run on the same nodes as Name and Job.
  • Other components (such as HBase) should only be run on the Data Nodes if they operate on data. You want to keep data local as much as possible. HBase needs about 16 GB Heap to avoid garbarge collection timeouts. Impala will consume up to 80% of available RAM.
  • Assumed to be lower performance machines than the Name Node and Job Tracker

Name Node and Job Tracker

  • Standby Name Node should (obviously) not be on the same machine as the Name Node.
  • Name Node (and Standby Name Node) and Job Tracker should be enterprise class machines (redundant power supplies, enterprise class raid’ed disks)
  • Name Node should have RAM in proportion to number of data blocks in the cluster. 1GB RAM for every 1 million blocks in HDFS. With 100 Data Node cluster, 64 GB RAM is fine. Since the machine’s tasks will be disk intensive, you’ll want enough RAM to minimize virtual memory swapping to disk.
  • 4 – 6 TB of disk, not raid’ed (JBOD configuration)
  • 2 CPUs (at least quad code). Recommend more CPUs and/or cores as opposed to faster CPU speed, since in a large cluster the higher speed will draw more power and generate more heat, yet not scale as well as if there were simply more CPUs or better yet nodes.

Cloudera has defined four standard configurations

  1. Light Processing (I’m not sure what the use case is for this. Prototype? Sandbox?)
  2. Balanced Compute (recommeded for your 1st cluster, since it’s not likely you’ll properly identify which configuration is best suited for your use case)
  3. Storage Heavy
  4. Compute Heavy