eBay worked with HortonWorks and ScaledRisk to improve Mean Time to Recovery (MTTR). Not only did this require faster recovery time, but also faster detection of failures.
The types of failures considered included the following, but only Node/Region server failures were included in the tests. The HBase tables contained 900 million rows.
- Node/Region server failed while writing
- Node/Region server failed while reading
- Rack failure
- Whole cluster failure
- Machine reboot (due to CPU temperature)
- NIC speed steps down to 100Mb/s from gigabit speeds
The tests had favorable results, with improvements submitted (some implemented, some proposed) into Apache HBase and HDFS.