I live-tweeted yesterday’s New Features in Hadoop 2.0 session at the Toronto Hadoop User Group. I think it’s a pretty good summary. If nothing else, it helped me absorb the information.
Spoiler: The iPad raffle was won by yours truly.
Hadoop 2.0
- At the Hadoop 2.0 session ‪#TorontoHUG
- Got here just before they figured out how to get the projector to emerge. It’s shy and hides in the ceiling. ‪#ToHUG‬
- Also, just after pizza got here. Maybe I shouldn’t have grabbed that emergency bagel. ‪#ToHUG‬
- Hadoop 0.20 -> Hadoop 0.20.2 -> Hadoop 2.0 ‪#ToHUG‬
- Hive, which is SQL on Hadoop w/ JDBC drivers, now has Binary, Timestamp data types and bitmap indexes ‪#ToHUG‬
- Pig gets Javascript UDFs and can be embedded in JS or Python ‪#ToHUG‬
- Cool, Sqoop getting an IBM DB2 connector is a bullet point in a Cloudera presentation ‪#ToHUG‬
MapReduce v2
- MapReduce v2=YARN=Yet Another Resource Negotiator #ToHUG‬
- MRv2 can support other processing frameworks. Eg graph processing, Santa Fe Institute simulations, etc ‪#ToHUG‬
- MRv2 is not about old API/new API. Unrelated ‪#ToHUG‬
- MRv2 good for research but not ready for production, even by startup standards ‪#ToHUG‬
- MRv2 allows Hamster MPI on Hadoop, Hama bulk synchronous processing, Giraph graph processing — none are MapReduce ‪#ToHUG‬
- In MRv1, JobTracker ran on Master, TaskTrackers ran on child nodes ‪#ToHUG‬
- In MRv1, JobTracker managed resouurces, scheduled, monitored
- Hadoop 2.0 can run either MRv1 or MRv2, but v2 not recommended for production ‪#ToHUG‬
- MRv2 has 1 Resource Manager, many Node Managers instead ‪#ToHUG‬
- But unlike JT, RM not a single point of failure because it now delegates App Managers to Node Managers per job ‪#ToHUG‬
- Resource management still central, but job management now decentralized ‪#ToHUG‬
- App Manager is like a library injected by RM into Node Managers ‪#ToHUG‬
- RM can die with low risk of losing jobs ‪#ToHUG‬
- App Master manages app lifecycle, negotiates resource containers with Resource Manager, monitors tasks on other nodes ‪#ToHUG‬
- In principle, no issue with running MRv2 on either HDFS or GPFS ‪#ToHUG‬
HDFS Federation
- The old Secondary NameNode is a terrible misnomer — not a backup NN ‪#ToHUG‬
- NameNode keeps track of all the data on all the DataNodes ‪#ToHUG‬
- HDFS Federation now allows for multiple NameNodes ‪#ToHUG‬
- In federation, each NN manages a namespace volume ‪#ToHUG‬
- HDFS Federation is not High Availability, is not Disaster Recovery ‪#ToHUG‬
- NNs in Federation do not communicate ‪#ToHUG‬
- Federation improves scalability, perf, isolation ‪#ToHUG‬
- A fed namespace volume consists of metadata, block pool (corresponding to files)‪#ToHUG‬
- All Data Nodes are used by all the federated NNs ‪#ToHUG‬
HDFS High Availability
- NameNode High Availability is a new feature different from NN Fed ‪#ToHUG‬
- Two NNs: one active, one standy. Standby takes over on failure. Fencing to prevent split brain by killing old one on takeover. ‪#ToHUG‬
- Non-HA NN can fail via crash or planned maintenance. ‪#ToHUG‬
- Clients and DNs only talk to active NN. Standy maintains a copy of active’s state. Purpose of NN unchanged. ‪#ToHUG‬
- Active NN writes state to shared filesystem — NFS. ‪#ToHUG‬
- HDFS fences to kill splitbrain via SSH or shell scripts. ‪#ToHUG‬
- NFS is the new single point of failure — yay! But NFS has proven HA solutions as well. ‪#ToHUG‬
- Failover can be auto or manual. Use hdfs haadmiin -failover nn01 nn02 command ‪#ToHUG‬
- HA is in Hadoop 2.0 regardless of MRv1 or v2. ‪#ToHUG‬
- NameNodes runs on your beefiest machine. Upwards of 16gb of ram typical. Limit is JVM memory management, Java garbage collector. ‪#ToHUG‬
- OMGWTFBBQ I just won the iPad3 raffle ‪#ToHUG‬
- The IBM guy got it. “I swear they don’t pay me very much”. Heh. ‪#embarrassed‬ ‪#woohoo‬ ‪#ToHUG‬
HBase
- HBase is a distributed, versioned, column-oriented, denormalized database ‪#ToHUG‬
- Horizontally scalable for fast random r/w. HBase is backend for FB Messages. Either source or sink for Hadoop jobs. ‪#ToHUG‬
- HBase is good for locality when used with Hadoop because data is stored near where it’s processed. ‪#ToHUG‬
- HB table consists of regions which consist of 1+ column family. Regions are the storage unit. Really good for sparse dbs. ‪#ToHUG‬
- Sparse=lots of empty fields=columns mostly empty=varying numbers of columns. ‪#ToHUG‬
- @ianhakes Ha! How about I drop the iPad off in your old office (aka mine)? 😉 ‪#ToHUG‬
- 1 column family is stored as 1 HFile. ‪#ToHUG‬
- HBase CRUD=Put Get Scan Delete ‪#ToHUG‬
- Google BigTable uses com-google, com-google-images, com-ibm, etc as keys for efficient scantables. Similar to what you want in Hbase ‪#ToHUG‬
- HMaster talks to HRegionServers contains HLog and HRegion contains MemStore contains HFile talks to DFS Client talks to DataNodes ‪#ToHUG‬
- ZooKeeper(s) stores config, logs, determines HMaster for HMaster, clients ‪#ToHUG‬
- HBase compaction=force write to disk. Relates to locality. ‪#ToHUG‬
- If you’re smart, don’t integrate HBase and Hive. Hive is map-reduce SQL job, HBase is a database. Hive best for regular HDFS data ‪#ToHUG‬
- HBase replication assumes column family exists in both clusters. No config required in child cluster. ‪#ToHUG‬
- BTW, I’m subbing adjective “child” for adj “slave” as a stylistic preference. ‪#ToHUG‬
- ZooKeeper quorum doesn’t have to be the same for HB replication. ZK quorum is an odd number of ZKs that agree on something. ‪#ToHUG‬
- HBase Coprocessor Observers=database triggers ‪#ToHUG‬
- HBase Co-pro Endpoints=stored procedures. In Java. Custom RPC protocol. Invoked by client on row or rowset. ‪#ToHUG‬
- HBase security has auth per table, per column family, per column qualifier. Stored in_acl_ table. ‪#ToHUG‬
One thought on “New Features in Hadoop 2.0 summary”