Category Archives: Big Data

Machine learning with Mahout and Hadoop session

Tonight I attended a session about machine learning with Mahout at BNotions. The session was organized through the Toronto Hadoop User Group.
Quick Notes
  • BNotions uses Hadoop and Mahout for their Vu mobile app. Vu is a smart news reader that recommends articles based on article similarity to things you like as well as user similarity to you.
  •  Graph theory and graph processing algos are helpful for this work.
  •  Likes, dislikes, reads, skips are the most important input for their machine learning. Also relevant: user preference for breadth of topics vs depth; recency; natural language processing to extract topic keyword and organize topics by similarity.
  •  Redis is used for transient storage. It has some useful ops above just key-value. They use S3 as a data warehouse, but it could just as easily be HDFS.
  •  They use Amazon EMR as the Hadoop cluster. EMR constrains technology choice. For example, harder to use HDFS, hence Redis instead. They are evaluating HBase as an alternative — performance differences not relevant for use case.
  •  They don’t currently adjust for article length as factor in recommendations.
  •  They use a third party API for NLP, not Hadoop specidically. Only once per article, so not a bottleneck yet. Not happy with NLP quality, though.
  •  Cascalog/JCascalog to query the Hadoop data using Scala.
  •  Scalability is limited by cost, not capability. May switch from EMR to dedicated cluster,  etc as cost grows.
  •  Data science 10%, engineering 90%. Stock algos for rapid application development, tweak after. Deployment (my own specialty!) can be painful.
  •  Service-oriented architecture (SOA) helps with deployment. Simplifies components, but adds a devops layer. Jenkins is used to automate builds.

cpio: chown failed – Invalid argument

I ran into this error while installing the IBM Java RPM on Red Hat Enterprise Linux:

error: unpacking of archive failed on file /opt/ibm/java-x86_64-60: cpio: chown failed - Invalid argument

The issue is due to /opt/ibm being an NFS mount on the system. There are known issues with running the chown command on NFS 4. One workaround is to specify NFS protocol version 3:

In my case, the NFS mount was specified in /etc/fstab, so I modified the relevant line there to say:

New Features in Hadoop 2.0 summary

I live-tweeted yesterday’s New Features in Hadoop 2.0 session at the Toronto Hadoop User Group. I think it’s a pretty good summary. If nothing else, it helped me absorb the information.

Spoiler: The iPad raffle was won by yours truly.

Hadoop 2.0

  • At the Hadoop 2.0 session ‪#TorontoHUG
  • Got here just before they figured out how to get the projector to emerge. It’s shy and hides in the ceiling. ‪#ToHUG‬
  • Also, just after pizza got here. Maybe I shouldn’t have grabbed that emergency bagel. ‪#ToHUG‬
  • Hadoop 0.20 -> Hadoop 0.20.2 -> Hadoop 2.0 ‪#ToHUG‬
  • Hive, which is SQL on Hadoop w/ JDBC drivers, now has Binary, Timestamp data types and bitmap indexes ‪#ToHUG‬
  • Pig gets Javascript UDFs and can be embedded in JS or Python ‪#ToHUG‬
  • Cool, Sqoop getting an IBM DB2 connector is a bullet point in a Cloudera presentation ‪#ToHUG‬

MapReduce v2

  • MapReduce v2=YARN=Yet Another Resource Negotiator #ToHUG‬
  • MRv2 can support other processing frameworks. Eg graph processing, Santa Fe Institute simulations, etc ‪#ToHUG‬
  • MRv2 is not about old API/new API. Unrelated ‪#ToHUG‬
  • MRv2 good for research but not ready for production, even by startup standards ‪#ToHUG‬
  • MRv2 allows Hamster MPI on Hadoop, Hama bulk synchronous processing, Giraph graph processing — none are MapReduce ‪#ToHUG‬
  • In MRv1, JobTracker ran on Master, TaskTrackers ran on child nodes ‪#ToHUG‬
  • In MRv1, JobTracker managed resouurces, scheduled, monitored
  • Hadoop 2.0 can run either MRv1 or MRv2, but v2 not recommended for production ‪#ToHUG‬
  • MRv2 has 1 Resource Manager, many Node Managers instead ‪#ToHUG‬
  • But unlike JT, RM not a single point of failure because it now delegates App Managers to Node Managers per job ‪#ToHUG‬
  • Resource management still central, but job management now decentralized ‪#ToHUG‬
  • App Manager is like a library injected by RM into Node Managers ‪#ToHUG‬
  • RM can die with low risk of losing jobs ‪#ToHUG‬
  • App Master manages app lifecycle, negotiates resource containers with Resource Manager, monitors tasks on other nodes ‪#ToHUG‬
  • In principle, no issue with running MRv2 on either HDFS or GPFS ‪#ToHUG‬

HDFS Federation

  • The old Secondary NameNode is a terrible misnomer — not a backup NN ‪#ToHUG‬
  • NameNode keeps track of all the data on all the DataNodes ‪#ToHUG‬
  • HDFS Federation now allows for multiple NameNodes ‪#ToHUG‬
  • In federation, each NN manages a namespace volume ‪#ToHUG‬
  • HDFS Federation is not High Availability, is not Disaster Recovery ‪#ToHUG‬
  • NNs in Federation do not communicate ‪#ToHUG‬
  • Federation improves scalability, perf, isolation ‪#ToHUG‬
  • A fed namespace volume consists of metadata, block pool (corresponding to files)‪#ToHUG‬
  • All Data Nodes are used by all the federated NNs ‪#ToHUG‬

HDFS High Availability

  • NameNode High Availability is a new feature different from NN Fed ‪#ToHUG‬
  • Two NNs: one active, one standy. Standby takes over on failure. Fencing to prevent split brain by killing old one on takeover. ‪#ToHUG‬
  • Non-HA NN can fail via crash or planned maintenance. ‪#ToHUG‬
  • Clients and DNs only talk to active NN. Standy maintains a copy of active’s state. Purpose of NN unchanged. ‪#ToHUG‬
  • Active NN writes state to shared filesystem — NFS. ‪#ToHUG‬
  • HDFS fences to kill splitbrain via SSH or shell scripts. ‪#ToHUG‬
  • NFS is the new single point of failure — yay! But NFS has proven HA solutions as well. ‪#ToHUG‬
  • Failover can be auto or manual. Use hdfs haadmiin -failover nn01 nn02 command ‪#ToHUG‬
  • HA is in Hadoop 2.0 regardless of MRv1 or v2. ‪#ToHUG‬
  • NameNodes runs on your beefiest machine. Upwards of 16gb of ram typical. Limit is JVM memory management, Java garbage collector. ‪#ToHUG‬
  • OMGWTFBBQ I just won the iPad3 raffle ‪#ToHUG‬
  • The IBM guy got it. “I swear they don’t pay me very much”. Heh. ‪#embarrassed‬ ‪#woohoo‬ ‪#ToHUG‬


  • HBase is a distributed, versioned, column-oriented, denormalized database ‪#ToHUG‬
  • Horizontally scalable for fast random r/w. HBase is backend for FB Messages. Either source or sink for Hadoop jobs. ‪#ToHUG‬
  • HBase is good for locality when used with Hadoop because data is stored near where it’s processed. ‪#ToHUG‬
  • HB table consists of regions which consist of 1+ column family. Regions are the storage unit. Really good for sparse dbs. ‪#ToHUG‬
  • Sparse=lots of empty fields=columns mostly empty=varying numbers of columns. ‪#ToHUG‬
  • @ianhakes Ha! How about I drop the iPad off in your old office (aka mine)? ;-) ‪#ToHUG‬
  • 1 column family is stored as 1 HFile. ‪#ToHUG‬
  • HBase CRUD=Put Get Scan Delete ‪#ToHUG‬
  • Google BigTable uses com-google, com-google-images, com-ibm, etc as keys for efficient scantables. Similar to what you want in Hbase ‪#ToHUG‬
  • HMaster talks to HRegionServers contains HLog and HRegion contains MemStore contains HFile talks to DFS Client talks to DataNodes ‪#ToHUG‬
  • ZooKeeper(s) stores config, logs, determines HMaster for HMaster, clients ‪#ToHUG‬
  • HBase compaction=force write to disk. Relates to locality. ‪#ToHUG‬
  • If you’re smart, don’t integrate HBase and Hive. Hive is map-reduce SQL job, HBase is a database. Hive best for regular HDFS data ‪#ToHUG‬
  • HBase replication assumes column family exists in both clusters. No config required in child cluster. ‪#ToHUG‬
  • BTW, I’m subbing adjective “child” for adj “slave” as a stylistic preference. ‪#ToHUG‬
  • ZooKeeper quorum doesn’t have to be the same for HB replication. ZK quorum is an odd number of ZKs that agree on something. ‪#ToHUG‬
  • HBase Coprocessor Observers=database triggers ‪#ToHUG‬
  • HBase Co-pro Endpoints=stored procedures. In Java. Custom RPC protocol. Invoked by client on row or rowset. ‪#ToHUG‬
  • HBase security has auth per table, per column family, per column qualifier. Stored in_acl_ table. ‪#ToHUG‬

Chairing a Hadoop workshop at CASCON 2011

I’ll be chairing the Crunching Big Data in the Cloud with Hadoop and BigInsights workshop at CASCON 2011 in Toronto on Wednesday, November 9th. @BSteinfe and @MariusButuc will be joining me as co-chairs.

The workshop will be an all day hands-on introduction to Hadoop, HDFS, MapReduce, Hive, and JAQL. The plan is to have ready Hadoop clusters running in the cloud for the various exercises.

Hadoop is a parallelized data processing framework. It lends itself very nicely to running in cloud environments like Amazon EC2 and IBM SCE, as the core concept is to split sophisticated queries across clusters of commodity hardware. On a basic level it’s an implementation of MapReduce in Java, but a great many tools in its eco system make it easy to formulate and execute queries on the fly.

The material will have some things in common with the free Hadoop Fundamentals course you can take on Big Data University today, though naturally adapted for the CASCON themes and with added hands-on instruction.

Next steps

IBM’s Hadoop distribution

My work for the past couple years has been to develop DB2 images and templates for various cloud platforms and to engage the DB2 community online. This is still the case, but increasingly I’m spending my time working with IBM InfoSphere BigInsights.

BigInsights is IBM’s distribution of Hadoop.

What’s Hadoop? It’s a great way to crunch through massive amounts of unstructured data like email archives, geographic stuff, economic measurements, and so on to find interesting patterns. It rests on the Map-Reduce algorithm, which is what Google uses when you search. Much Google’s success rests on Map-Reduce’s ability to scale out on commodity hardware.

(Notably, the whole Cloud Computing thing is the flip side of using massive arrays of commodity hardware. Since you have so much of it, you need a way to automate and abstract the management as much as possible. Since you’ve automated and abstracted away management, you might as well sell it as a service.)

Hadoop itself is an Apache Software Foundation project nurtured by Yahoo among others. It’s gaining an increasing number of commercial distributions including Cloudera, IBM, and now Hortonworks.

You can quickly try out BigInsights Basic on the cloud or download it to your own machine.

I really do recommend the macro that