My team will be supporting the Spark hackathon in Boston on May 28-30.
This weekend’s a good chance to teach yourself Spark.
“Apache Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide, Spark is the alternative. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining.”
Happy Memorial Day weekend!
My friends over at Big Data University just launched a refresh of their Apache Spark course. Spark is an engine for processing and mining large amounts of data quickly.
The course takes advantage of a Docker image for Spark. Docker provides a way to run Spark on a typical laptop without getting a beefy server.
Spark’s on my personal to-learn list, so I just pencilled in a slot to take the course myself on my calendar.
I ran into this error while installing the IBM Java RPM on Red Hat Enterprise Linux:
error: unpacking of archive failed on file /opt/ibm/java-x86_64-60: cpio: chown failed - Invalid argument
The issue is due to /opt/ibm being an NFS mount on the system. There are known issues with running the chown command on NFS 4. One workaround is to specify NFS protocol version 3:
mount -t nfs -o vers=3 127.0.0.1:/share /mnt
In my case, the NFS mount was specified in /etc/fstab, so I modified the relevant line there to say:
127.0.0.1:/opt/ibm /opt/ibm nfs rw,vers=3 0 0
I live-tweeted yesterday’s New Features in Hadoop 2.0 session at the Toronto Hadoop User Group. I think it’s a pretty good summary. If nothing else, it helped me absorb the information.
Spoiler: The iPad raffle was won by yours truly.
- At the Hadoop 2.0 session â€ª#TorontoHUG
- Got here just before they figured out how to get the projector to emerge. It’s shy and hides in the ceiling. â€ª#ToHUGâ€¬
- Also, just after pizza got here. Maybe I shouldn’t have grabbed that emergency bagel. â€ª#ToHUGâ€¬
- Hadoop 0.20 -> Hadoop 0.20.2 -> Hadoop 2.0 â€ª#ToHUGâ€¬
- Hive, which is SQL on Hadoop w/ JDBC drivers, now has Binary, Timestamp data types and bitmap indexes â€ª#ToHUGâ€¬
- Cool, Sqoop getting an IBM DB2 connector is a bullet point in a Cloudera presentation â€ª#ToHUGâ€¬
- MapReduce v2=YARN=Yet Another Resource Negotiator #ToHUGâ€¬
- MRv2 can support other processing frameworks. Eg graph processing, Santa Fe Institute simulations, etc â€ª#ToHUGâ€¬
- MRv2 is not about old API/new API. Unrelated â€ª#ToHUGâ€¬
- MRv2 good for research but not ready for production, even by startup standards â€ª#ToHUGâ€¬
- MRv2 allows Hamster MPI on Hadoop, Hama bulk synchronous processing, Giraph graph processing — none are MapReduce â€ª#ToHUGâ€¬
- In MRv1, JobTracker ran on Master, TaskTrackers ran on child nodes â€ª#ToHUGâ€¬
- In MRv1, JobTracker managed resouurces, scheduled, monitored
- Hadoop 2.0 can run either MRv1 or MRv2, but v2 not recommended for production â€ª#ToHUGâ€¬
- MRv2 has 1 Resource Manager, many Node Managers instead â€ª#ToHUGâ€¬
- But unlike JT, RM not a single point of failure because it now delegates App Managers to Node Managers per job â€ª#ToHUGâ€¬
- Resource management still central, but job management now decentralized â€ª#ToHUGâ€¬
- App Manager is like a library injected by RM into Node Managers â€ª#ToHUGâ€¬
- RM can die with low risk of losing jobs â€ª#ToHUGâ€¬
- App Master manages app lifecycle, negotiates resource containers with Resource Manager, monitors tasks on other nodes â€ª#ToHUGâ€¬
- In principle, no issue with running MRv2 on either HDFS or GPFS â€ª#ToHUGâ€¬
- The old Secondary NameNode is a terrible misnomer — not a backup NN â€ª#ToHUGâ€¬
- NameNode keeps track of all the data on all the DataNodes â€ª#ToHUGâ€¬
- HDFS Federation now allows for multiple NameNodes â€ª#ToHUGâ€¬
- In federation, each NN manages a namespace volume â€ª#ToHUGâ€¬
- HDFS Federation is not High Availability, is not Disaster Recovery â€ª#ToHUGâ€¬
- NNs in Federation do not communicate â€ª#ToHUGâ€¬
- Federation improves scalability, perf, isolation â€ª#ToHUGâ€¬
- A fed namespace volume consists of metadata, block pool (corresponding to files)â€ª#ToHUGâ€¬
- All Data Nodes are used by all the federated NNs â€ª#ToHUGâ€¬
HDFS High Availability
- NameNode High Availability is a new feature different from NN Fed â€ª#ToHUGâ€¬
- Two NNs: one active, one standy. Standby takes over on failure. Fencing to prevent split brain by killing old one on takeover. â€ª#ToHUGâ€¬
- Non-HA NN can fail via crash or planned maintenance. â€ª#ToHUGâ€¬
- Clients and DNs only talk to active NN. Standy maintains a copy of active’s state. Purpose of NN unchanged. â€ª#ToHUGâ€¬
- Active NN writes state to shared filesystem — NFS. â€ª#ToHUGâ€¬
- HDFS fences to kill splitbrain via SSH or shell scripts. â€ª#ToHUGâ€¬
- NFS is the new single point of failure — yay! But NFS has proven HA solutions as well. â€ª#ToHUGâ€¬
- Failover can be auto or manual. Use hdfs haadmiin -failover nn01 nn02 command â€ª#ToHUGâ€¬
- HA is in Hadoop 2.0 regardless of MRv1 or v2. â€ª#ToHUGâ€¬
- NameNodes runs on your beefiest machine. Upwards of 16gb of ram typical. Limit is JVM memory management, Java garbage collector. â€ª#ToHUGâ€¬
- OMGWTFBBQ I just won the iPad3 raffle â€ª#ToHUGâ€¬
- The IBM guy got it. “I swear they don’t pay me very much”. Heh. â€ª#embarrassedâ€¬ â€ª#woohooâ€¬ â€ª#ToHUGâ€¬
- HBase is a distributed, versioned, column-oriented, denormalized database â€ª#ToHUGâ€¬
- Horizontally scalable for fast random r/w. HBase is backend for FB Messages. Either source or sink for Hadoop jobs. â€ª#ToHUGâ€¬
- HBase is good for locality when used with Hadoop because data is stored near where it’s processed. â€ª#ToHUGâ€¬
- HB table consists of regions which consist of 1+ column family. Regions are the storage unit. Really good for sparse dbs. â€ª#ToHUGâ€¬
- Sparse=lots of empty fields=columns mostly empty=varying numbers of columns. â€ª#ToHUGâ€¬
- @ianhakes Ha! How about I drop the iPad off in your old office (aka mine)? ;-) â€ª#ToHUGâ€¬
- 1 column family is stored as 1 HFile. â€ª#ToHUGâ€¬
- HBase CRUD=Put Get Scan Delete â€ª#ToHUGâ€¬
- Google BigTable uses com-google, com-google-images, com-ibm, etc as keys for efficient scantables. Similar to what you want in Hbase â€ª#ToHUGâ€¬
- HMaster talks to HRegionServers contains HLog and HRegion contains MemStore contains HFile talks to DFS Client talks to DataNodes â€ª#ToHUGâ€¬
- ZooKeeper(s) stores config, logs, determines HMaster for HMaster, clients â€ª#ToHUGâ€¬
- HBase compaction=force write to disk. Relates to locality. â€ª#ToHUGâ€¬
- If you’re smart, don’t integrate HBase and Hive. Hive is map-reduce SQL job, HBase is a database. Hive best for regular HDFS data â€ª#ToHUGâ€¬
- HBase replication assumes column family exists in both clusters. No config required in child cluster. â€ª#ToHUGâ€¬
- BTW, I’m subbing adjective “child” for adj “slave” as a stylistic preference. â€ª#ToHUGâ€¬
- ZooKeeper quorum doesn’t have to be the same for HB replication. ZK quorum is an odd number of ZKs that agree on something. â€ª#ToHUGâ€¬
- HBase Coprocessor Observers=database triggers â€ª#ToHUGâ€¬
- HBase Co-pro Endpoints=stored procedures. In Java. Custom RPC protocol. Invoked by client on row or rowset. â€ª#ToHUGâ€¬
- HBase security has auth per table, per column family, per column qualifier. Stored in_acl_ table. â€ª#ToHUGâ€¬
I’ll be attending the New Features in Hadoop 2.0 – HA, FNN, HBase Coprocessors info session tonight. The session is being organized by the Toronto Hadoop User Group. It’s at 7pm near King East and Parliament in Toronto.
Feel free to say hello if you see me there.
I’ll be chairing theÂ Crunching Big Data in the Cloud with Hadoop and BigInsightsÂ workshop at CASCON 2011 in Toronto on Wednesday, November 9th. @BSteinfe and @MariusButuc will be joining me as co-chairs.
The workshop will be an all day hands-on introduction to Hadoop, HDFS, MapReduce, Hive, and JAQL. The plan is to have ready Hadoop clusters running in the cloud for the various exercises.
Hadoop is a parallelized data processing framework. It lends itself very nicely to running in cloud environments like Amazon EC2 and IBM SCE, as the core concept is to split sophisticated queries across clusters of commodity hardware. On a basic level it’s an implementation of MapReduce in Java, but a great many tools in its eco system make it easy to formulate and execute queries on the fly.
The material will have some things in common with the free Hadoop Fundamentals course you can take onÂ Big Data University today, though naturally adapted for the CASCON themes and with added hands-on instruction.