Normally we’d like to think of Hadoop running on hundreds of racks of commodity hardware, but that doesn’t mean that we should forget all of the reasons why we love virtualization.
This case study explains that how & why, and provides benchmarks of the experiment of running Hadoop on VMWare. Of course the experiment was successful, as the study was published by VMWare.
The moral of the story is that just because Hadoop can run on commodity hardware doesn’t mean that it has to, or that it’s the best way to deploy.
I haven’t tried this myself (don’t have a RaspberryPi, but only have an Arduino), and even if it’s possible to get it to install I’m not sure what the runtime could accomplish, but this guy has published a short list of instructions on how to install Hadoop on RaspberryPi.
Hadoop is generally assumed to run on clusters of generic commodity hardware. Intel has just released a customized/optimized distribution that it claims is up to 30x faster if run on the Xenon E7 v2 family of processors, which is hardly generic or commodity.
By definition, a SAN is about consolidating data and Hadoop is about distributing data. Can they co-exist? Not according to this article.
If you take data out of a Hadoop node and put it on a SAN, you’re reducing performance. You want data to transfer to the CPU at bus speed, not network speed. And maybe a heavy Hadoop load could saturate your network.
Nodes within a cluster do NOT need to be identical (CPUs, RAM, disk storage). The scheduler takes this into account. For a prototype, or the early release of an Agile system, it’s possible to just toss in whatever you have and then evolve to an environment that’s easier to support. But the key is that Hadoop couldn’t care less.
We generally think of hardware for Hadoop being generic and commodity, but HP claims to have specialized (fastest, pre-optimized, pre-tested) hardware for its own reference architecture. How can it be pre-optimized if they have no knowledge of the target dataset?
Posted in hardware
Good (and very short) article about configuring Yarn, specifically allocating CPUs and RAM for the OS, JVMs, and MapReduce
Cisco has created two reference architectures for Cloudera’s CDH “high performance” and “high capacity” configurations. I suppose that there’s no need for a reference architecture for “light processing” or “balanced”.
I’m summarizing this article. For specifics (such as how to configure split machines across racks to better configure the network switches) see the article. None of this content is operating system or hardware vendor specific, but generally the discussions assume Linux.
Goal is to minimize data movement and process on the same machine that stores the data. Therefore each machine in the cluster needs appropriate CPU and disk. Problem is that when building the cluster the nature of the queries and the resulting bottlenecks may not yet be known. If a business is building its first Hadoop cluster, it may not yet fully understand the types of business problems that will eventually be solved by it. That’s in contrast to a business deploying it’s Nth Oracle server.
Types of bottlenecks:
- IO: reading from disk (or a network location)
- data import/export
- data transformation
- CPU: processing the Map query
- text mining
- natural language processing
Other issues, since a cluster could eventually scale to hundreds or thousands off machines
The Cloudera Manager can provide realtime statistics about how a currently running MapReduce job impacts the CPU, disk, and network load.
Roles of the components within a Hadoop cluster:
- Name Node (and Standby Name Node): coordinating data storage on the cluster
- Job Tracker: coordinating data processing
- Task Tracker
- Data Node
Data Node and Task Tracker
- The vast majority of the machines in a cluster will only peform the roles of Data Node and Task Tracker, which should not be run on the same nodes as Name and Job.
- Other components (such as HBase) should only be run on the Data Nodes if they operate on data. You want to keep data local as much as possible. HBase needs about 16 GB Heap to avoid garbarge collection timeouts. Impala will consume up to 80% of available RAM.
- Assumed to be lower performance machines than the Name Node and Job Tracker
Name Node and Job Tracker
- Standby Name Node should (obviously) not be on the same machine as the Name Node.
- Name Node (and Standby Name Node) and Job Tracker should be enterprise class machines (redundant power supplies, enterprise class raid’ed disks)
- Name Node should have RAM in proportion to number of data blocks in the cluster. 1GB RAM for every 1 million blocks in HDFS. With 100 Data Node cluster, 64 GB RAM is fine. Since the machine’s tasks will be disk intensive, you’ll want enough RAM to minimize virtual memory swapping to disk.
- 4 – 6 TB of disk, not raid’ed (JBOD configuration)
- 2 CPUs (at least quad code). Recommend more CPUs and/or cores as opposed to faster CPU speed, since in a large cluster the higher speed will draw more power and generate more heat, yet not scale as well as if there were simply more CPUs or better yet nodes.
Cloudera has defined four standard configurations
- Light Processing (I’m not sure what the use case is for this. Prototype? Sandbox?)
- Balanced Compute (recommeded for your 1st cluster, since it’s not likely you’ll properly identify which configuration is best suited for your use case)
- Storage Heavy
- Compute Heavy
Posted in cloudera, DataNode, hardware, HBase, Impala, JobTracker, Linux, MapReduce, NameNode, TaskTracking, tutorial, UNIX
HDFS is fault tolerant. Each file is broken up into blocks, and each block must be written to more than one server. The number of servers is configurable, but three is the common configuration. Just as with RAID, this provides fault tolerance and increase retrieval performance.
When a block is read, its checksum indicates whether the block is valid or corrupted. If corrupted, and depending on the scope of the corruption, the block may be rewriten or the server may be taken out of the cluster and the blocks spread to other existing servers. If the cluster is running within an elastic cloud then either the server is healed or a new server is added.
Unlike high end SAN hardware which is architected to avoid failure, HDFS assumes that its low end equipment will fail so it has self-healing built into its operating model.