New Features in Hadoop 2.0 summary

I live-tweeted yesterday’s New Features in Hadoop 2.0 session at the Toronto Hadoop User Group. I think it’s a pretty good summary. If nothing else, it helped me absorb the information.

Spoiler: The iPad raffle was won by yours truly.

Hadoop 2.0

At the Hadoop 2.0 session â€ª#TorontoHUG
Got here just before they figured out how to get the projector to emerge. It’s shy and hides in the ceiling. â€ª#ToHUGâ€¬
Also, just after pizza got here. Maybe I shouldn’t have grabbed that emergency bagel. â€ª#ToHUGâ€¬
Hadoop 0.20 -> Hadoop 0.20.2 -> Hadoop 2.0 â€ª#ToHUGâ€¬
Hive, which is SQL on Hadoop w/ JDBC drivers, now has Binary, Timestamp data types and bitmap indexes â€ª#ToHUGâ€¬
Pig gets Javascript UDFs and can be embedded in JS or Python â€ª#ToHUGâ€¬
Cool, Sqoop getting an IBM DB2 connector is a bullet point in a Cloudera presentation â€ª#ToHUGâ€¬

MapReduce v2

MapReduce v2=YARN=Yet Another Resource Negotiator #ToHUGâ€¬
MRv2 can support other processing frameworks. Eg graph processing, Santa Fe Institute simulations, etc â€ª#ToHUGâ€¬
MRv2 is not about old API/new API. Unrelated â€ª#ToHUGâ€¬
MRv2 good for research but not ready for production, even by startup standards â€ª#ToHUGâ€¬
MRv2 allows Hamster MPI on Hadoop, Hama bulk synchronous processing, Giraph graph processing — none are MapReduce â€ª#ToHUGâ€¬
In MRv1, JobTracker ran on Master, TaskTrackers ran on child nodes â€ª#ToHUGâ€¬
In MRv1, JobTracker managed resouurces, scheduled, monitored
Hadoop 2.0 can run either MRv1 or MRv2, but v2 not recommended for production â€ª#ToHUGâ€¬
MRv2 has 1 Resource Manager, many Node Managers instead â€ª#ToHUGâ€¬
But unlike JT, RM not a single point of failure because it now delegates App Managers to Node Managers per job â€ª#ToHUGâ€¬
Resource management still central, but job management now decentralized â€ª#ToHUGâ€¬
App Manager is like a library injected by RM into Node Managers â€ª#ToHUGâ€¬
RM can die with low risk of losing jobs â€ª#ToHUGâ€¬
App Master manages app lifecycle, negotiates resource containers with Resource Manager, monitors tasks on other nodes â€ª#ToHUGâ€¬
In principle, no issue with running MRv2 on either HDFS or GPFS â€ª#ToHUGâ€¬

HDFS Federation

The old Secondary NameNode is a terrible misnomer — not a backup NN â€ª#ToHUGâ€¬
NameNode keeps track of all the data on all the DataNodes â€ª#ToHUGâ€¬
HDFS Federation now allows for multiple NameNodes â€ª#ToHUGâ€¬
In federation, each NN manages a namespace volume â€ª#ToHUGâ€¬
HDFS Federation is not High Availability, is not Disaster Recovery â€ª#ToHUGâ€¬
NNs in Federation do not communicate â€ª#ToHUGâ€¬
Federation improves scalability, perf, isolation â€ª#ToHUGâ€¬
A fed namespace volume consists of metadata, block pool (corresponding to files)â€ª#ToHUGâ€¬
All Data Nodes are used by all the federated NNs â€ª#ToHUGâ€¬

HDFS High Availability

NameNode High Availability is a new feature different from NN Fed â€ª#ToHUGâ€¬
Two NNs: one active, one standy. Standby takes over on failure. Fencing to prevent split brain by killing old one on takeover. â€ª#ToHUGâ€¬
Non-HA NN can fail via crash or planned maintenance. â€ª#ToHUGâ€¬
Clients and DNs only talk to active NN. Standy maintains a copy of active’s state. Purpose of NN unchanged. â€ª#ToHUGâ€¬
Active NN writes state to shared filesystem — NFS. â€ª#ToHUGâ€¬
HDFS fences to kill splitbrain via SSH or shell scripts. â€ª#ToHUGâ€¬
NFS is the new single point of failure — yay! But NFS has proven HA solutions as well. â€ª#ToHUGâ€¬
Failover can be auto or manual. Use hdfs haadmiin -failover nn01 nn02 command â€ª#ToHUGâ€¬
HA is in Hadoop 2.0 regardless of MRv1 or v2. â€ª#ToHUGâ€¬
NameNodes runs on your beefiest machine. Upwards of 16gb of ram typical. Limit is JVM memory management, Java garbage collector. â€ª#ToHUGâ€¬
OMGWTFBBQ I just won the iPad3 raffle â€ª#ToHUGâ€¬
The IBM guy got it. “I swear they don’t pay me very much”. Heh. â€ª#embarrassedâ€¬ â€ª#woohooâ€¬ â€ª#ToHUGâ€¬

HBase

HBase is a distributed, versioned, column-oriented, denormalized database â€ª#ToHUGâ€¬
Horizontally scalable for fast random r/w. HBase is backend for FB Messages. Either source or sink for Hadoop jobs. â€ª#ToHUGâ€¬
HBase is good for locality when used with Hadoop because data is stored near where it’s processed. â€ª#ToHUGâ€¬
HB table consists of regions which consist of 1+ column family. Regions are the storage unit. Really good for sparse dbs. â€ª#ToHUGâ€¬
Sparse=lots of empty fields=columns mostly empty=varying numbers of columns. â€ª#ToHUGâ€¬
@ianhakes Ha! How about I drop the iPad off in your old office (aka mine)? 😉 â€ª#ToHUGâ€¬
1 column family is stored as 1 HFile. â€ª#ToHUGâ€¬
HBase CRUD=Put Get Scan Delete â€ª#ToHUGâ€¬
Google BigTable uses com-google, com-google-images, com-ibm, etc as keys for efficient scantables. Similar to what you want in Hbase â€ª#ToHUGâ€¬
HMaster talks to HRegionServers contains HLog and HRegion contains MemStore contains HFile talks to DFS Client talks to DataNodes â€ª#ToHUGâ€¬
ZooKeeper(s) stores config, logs, determines HMaster for HMaster, clients â€ª#ToHUGâ€¬
HBase compaction=force write to disk. Relates to locality. â€ª#ToHUGâ€¬
If you’re smart, don’t integrate HBase and Hive. Hive is map-reduce SQL job, HBase is a database. Hive best for regular HDFS data â€ª#ToHUGâ€¬
HBase replication assumes column family exists in both clusters. No config required in child cluster. â€ª#ToHUGâ€¬
BTW, I’m subbing adjective “child” for adj “slave” as a stylistic preference. â€ª#ToHUGâ€¬
ZooKeeper quorum doesn’t have to be the same for HB replication. ZK quorum is an odd number of ZKs that agree on something. â€ª#ToHUGâ€¬
HBase Coprocessor Observers=database triggers â€ª#ToHUGâ€¬
HBase Co-pro Endpoints=stored procedures. In Java. Custom RPC protocol. Invoked by client on row or rowset. â€ª#ToHUGâ€¬
HBase security has auth per table, per column family, per column qualifier. Stored in_acl_ table. â€ª#ToHUGâ€¬

Comments

One response to “New Features in Hadoop 2.0 summary”

July 18, 2012

New Features in Hadoop 2.0 summary « Big Data Analytics

[…] on lpetr.org ì´ê²ƒì´ ì¢‹ì•„ìš”:ì¢‹ì•„í•˜ê¸°Be the first to like […]

New Features in Hadoop 2.0 summary

Comments

One response to “New Features in Hadoop 2.0 summary”

Leave a Reply Cancel reply

More posts

Migrations

Joining Upgrade

Join or union ranges from multiple tabs in Google Sheets

@here considered harmful