Intro to #datascience and #spark at #ibminsight 2015

I’ll be teaching two hands-on labs at Insight 2015 in Las Vegas:

LCD-3459 Introduction to Data Science
Data science is a very popular job profile and in great demand in a wide variety of industries. You no longer need a Ph.D. in mathematics or statistics to become a data scientist. Any data professional can upgrade their skills and study data science. This lab introduces the basic concepts of data science and provides hands-on examples to help you apply these concepts.

Update: Do the Intro to Data Science Lab on Big Data University

LCD-3479 Fundamentals of Spark
Spark is one of the most important technologies for big data analytics and Spark skills are in great demand. This lab session introduces Spark fundamentals and applies the concepts using hands-on examples with a Spark cluster in cloud. You can also download a Docker image to your own laptop and run the lab projects there.

Update: Do the Fundamentals of Spark Lab on Big Data University

If you can’t make it, you can teach yourself data science at your own pace at

$10k! Spark hackathon in San Fran

I’ll be in San Francisco this weekend helping run the Apache Spark hackathon, and afterwards I’ll be at Spark Summit 2015.

If you’re curious at all about Spark, you should come out and hack with us. We’ll have some fun data sets and help you find a team.

You can take the free Spark Fundamentals course on Big Data University to brush up on your Spark skills. Spark is a framework for fast in-memory and batch analytics processing. It’s algorithmically smarter and so a lot faster than traditional Hadoop.

There’s $10k in prizes at the hackathon.

$7,500! Spark hackathon in Boston May 28-30

My team will be supporting the Spark hackathon in Boston on May 28-30.

This weekend’s a good chance to teach yourself Spark.

“Apache Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide, Spark is the alternative. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining.”

Happy Memorial Day weekend!

Free Apache Spark course with Docker image

My friends over at Big Data University just launched a refresh of their Apache Spark course. Spark is an engine for processing and mining large amounts of data quickly.

The course takes advantage of a Docker image for Spark. Docker provides a way to run Spark on a typical laptop without getting a beefy server.

Spark’s on my personal to-learn list, so I just pencilled in a slot to take the course myself on my calendar.

Machine learning with Mahout and Hadoop session

Tonight I attended a session about machine learning with Mahout at BNotions. The session was organized through the Toronto Hadoop User Group.
Quick Notes
  • BNotions uses Hadoop and Mahout for their Vu mobile app. Vu is a smart news reader that recommends articles based on article similarity to things you like as well as user similarity to you.
  •  Graph theory and graph processing algos are helpful for this work.
  •  Likes, dislikes, reads, skips are the most important input for their machine learning. Also relevant: user preference for breadth of topics vs depth; recency; natural language processing to extract topic keyword and organize topics by similarity.
  •  Redis is used for transient storage. It has some useful ops above just key-value. They use S3 as a data warehouse, but it could just as easily be HDFS.
  •  They use Amazon EMR as the Hadoop cluster. EMR constrains technology choice. For example, harder to use HDFS, hence Redis instead. They are evaluating HBase as an alternative — performance differences not relevant for use case.
  •  They don’t currently adjust for article length as factor in recommendations.
  •  They use a third party API for NLP, not Hadoop specidically. Only once per article, so not a bottleneck yet. Not happy with NLP quality, though.
  •  Cascalog/JCascalog to query the Hadoop data using Scala.
  •  Scalability is limited by cost, not capability. May switch from EMR to dedicated cluster,  etc as cost grows.
  •  Data science 10%, engineering 90%. Stock algos for rapid application development, tweak after. Deployment (my own specialty!) can be painful.
  •  Service-oriented architecture (SOA) helps with deployment. Simplifies components, but adds a devops layer. Jenkins is used to automate builds.

cpio: chown failed – Invalid argument

I ran into this error while installing the IBM Java RPM on Red Hat Enterprise Linux:

error: unpacking of archive failed on file /opt/ibm/java-x86_64-60: cpio: chown failed - Invalid argument

The issue is due to /opt/ibm being an NFS mount on the system. There are known issues with running the chown command on NFS 4. One workaround is to specify NFS protocol version 3:

mount -t nfs -o vers=3 /mnt

In my case, the NFS mount was specified in /etc/fstab, so I modified the relevant line there to say: /opt/ibm nfs rw,vers=3 0 0

New Features in Hadoop 2.0 summary

I live-tweeted yesterday’s New Features in Hadoop 2.0 session at the Toronto Hadoop User Group. I think it’s a pretty good summary. If nothing else, it helped me absorb the information.

Spoiler: The iPad raffle was won by yours truly.

Hadoop 2.0

  • At the Hadoop 2.0 session ‪#TorontoHUG
  • Got here just before they figured out how to get the projector to emerge. It’s shy and hides in the ceiling. ‪#ToHUG‬
  • Also, just after pizza got here. Maybe I shouldn’t have grabbed that emergency bagel. ‪#ToHUG‬
  • Hadoop 0.20 -> Hadoop 0.20.2 -> Hadoop 2.0 ‪#ToHUG‬
  • Hive, which is SQL on Hadoop w/ JDBC drivers, now has Binary, Timestamp data types and bitmap indexes ‪#ToHUG‬
  • Pig gets Javascript UDFs and can be embedded in JS or Python ‪#ToHUG‬
  • Cool, Sqoop getting an IBM DB2 connector is a bullet point in a Cloudera presentation ‪#ToHUG‬

MapReduce v2

  • MapReduce v2=YARN=Yet Another Resource Negotiator #ToHUG‬
  • MRv2 can support other processing frameworks. Eg graph processing, Santa Fe Institute simulations, etc ‪#ToHUG‬
  • MRv2 is not about old API/new API. Unrelated ‪#ToHUG‬
  • MRv2 good for research but not ready for production, even by startup standards ‪#ToHUG‬
  • MRv2 allows Hamster MPI on Hadoop, Hama bulk synchronous processing, Giraph graph processing — none are MapReduce ‪#ToHUG‬
  • In MRv1, JobTracker ran on Master, TaskTrackers ran on child nodes ‪#ToHUG‬
  • In MRv1, JobTracker managed resouurces, scheduled, monitored
  • Hadoop 2.0 can run either MRv1 or MRv2, but v2 not recommended for production ‪#ToHUG‬
  • MRv2 has 1 Resource Manager, many Node Managers instead ‪#ToHUG‬
  • But unlike JT, RM not a single point of failure because it now delegates App Managers to Node Managers per job ‪#ToHUG‬
  • Resource management still central, but job management now decentralized ‪#ToHUG‬
  • App Manager is like a library injected by RM into Node Managers ‪#ToHUG‬
  • RM can die with low risk of losing jobs ‪#ToHUG‬
  • App Master manages app lifecycle, negotiates resource containers with Resource Manager, monitors tasks on other nodes ‪#ToHUG‬
  • In principle, no issue with running MRv2 on either HDFS or GPFS ‪#ToHUG‬

HDFS Federation

  • The old Secondary NameNode is a terrible misnomer — not a backup NN ‪#ToHUG‬
  • NameNode keeps track of all the data on all the DataNodes ‪#ToHUG‬
  • HDFS Federation now allows for multiple NameNodes ‪#ToHUG‬
  • In federation, each NN manages a namespace volume ‪#ToHUG‬
  • HDFS Federation is not High Availability, is not Disaster Recovery ‪#ToHUG‬
  • NNs in Federation do not communicate ‪#ToHUG‬
  • Federation improves scalability, perf, isolation ‪#ToHUG‬
  • A fed namespace volume consists of metadata, block pool (corresponding to files)‪#ToHUG‬
  • All Data Nodes are used by all the federated NNs ‪#ToHUG‬

HDFS High Availability

  • NameNode High Availability is a new feature different from NN Fed ‪#ToHUG‬
  • Two NNs: one active, one standy. Standby takes over on failure. Fencing to prevent split brain by killing old one on takeover. ‪#ToHUG‬
  • Non-HA NN can fail via crash or planned maintenance. ‪#ToHUG‬
  • Clients and DNs only talk to active NN. Standy maintains a copy of active’s state. Purpose of NN unchanged. ‪#ToHUG‬
  • Active NN writes state to shared filesystem — NFS. ‪#ToHUG‬
  • HDFS fences to kill splitbrain via SSH or shell scripts. ‪#ToHUG‬
  • NFS is the new single point of failure — yay! But NFS has proven HA solutions as well. ‪#ToHUG‬
  • Failover can be auto or manual. Use hdfs haadmiin -failover nn01 nn02 command ‪#ToHUG‬
  • HA is in Hadoop 2.0 regardless of MRv1 or v2. ‪#ToHUG‬
  • NameNodes runs on your beefiest machine. Upwards of 16gb of ram typical. Limit is JVM memory management, Java garbage collector. ‪#ToHUG‬
  • OMGWTFBBQ I just won the iPad3 raffle ‪#ToHUG‬
  • The IBM guy got it. “I swear they don’t pay me very much”. Heh. ‪#embarrassed‬ ‪#woohoo‬ ‪#ToHUG‬


  • HBase is a distributed, versioned, column-oriented, denormalized database ‪#ToHUG‬
  • Horizontally scalable for fast random r/w. HBase is backend for FB Messages. Either source or sink for Hadoop jobs. ‪#ToHUG‬
  • HBase is good for locality when used with Hadoop because data is stored near where it’s processed. ‪#ToHUG‬
  • HB table consists of regions which consist of 1+ column family. Regions are the storage unit. Really good for sparse dbs. ‪#ToHUG‬
  • Sparse=lots of empty fields=columns mostly empty=varying numbers of columns. ‪#ToHUG‬
  • @ianhakes Ha! How about I drop the iPad off in your old office (aka mine)? ;-) ‪#ToHUG‬
  • 1 column family is stored as 1 HFile. ‪#ToHUG‬
  • HBase CRUD=Put Get Scan Delete ‪#ToHUG‬
  • Google BigTable uses com-google, com-google-images, com-ibm, etc as keys for efficient scantables. Similar to what you want in Hbase ‪#ToHUG‬
  • HMaster talks to HRegionServers contains HLog and HRegion contains MemStore contains HFile talks to DFS Client talks to DataNodes ‪#ToHUG‬
  • ZooKeeper(s) stores config, logs, determines HMaster for HMaster, clients ‪#ToHUG‬
  • HBase compaction=force write to disk. Relates to locality. ‪#ToHUG‬
  • If you’re smart, don’t integrate HBase and Hive. Hive is map-reduce SQL job, HBase is a database. Hive best for regular HDFS data ‪#ToHUG‬
  • HBase replication assumes column family exists in both clusters. No config required in child cluster. ‪#ToHUG‬
  • BTW, I’m subbing adjective “child” for adj “slave” as a stylistic preference. ‪#ToHUG‬
  • ZooKeeper quorum doesn’t have to be the same for HB replication. ZK quorum is an odd number of ZKs that agree on something. ‪#ToHUG‬
  • HBase Coprocessor Observers=database triggers ‪#ToHUG‬
  • HBase Co-pro Endpoints=stored procedures. In Java. Custom RPC protocol. Invoked by client on row or rowset. ‪#ToHUG‬
  • HBase security has auth per table, per column family, per column qualifier. Stored in_acl_ table. ‪#ToHUG‬

Chairing a Hadoop workshop at CASCON 2011

I’ll be chairing the Crunching Big Data in the Cloud with Hadoop and BigInsights workshop at CASCON 2011 in Toronto on Wednesday, November 9th. @BSteinfe and @MariusButuc will be joining me as co-chairs.

The workshop will be an all day hands-on introduction to Hadoop, HDFS, MapReduce, Hive, and JAQL. The plan is to have ready Hadoop clusters running in the cloud for the various exercises.

Hadoop is a parallelized data processing framework. It lends itself very nicely to running in cloud environments like Amazon EC2 and IBM SCE, as the core concept is to split sophisticated queries across clusters of commodity hardware. On a basic level it’s an implementation of MapReduce in Java, but a great many tools in its eco system make it easy to formulate and execute queries on the fly.

The material will have some things in common with the free Hadoop Fundamentals course you can take on Big Data University today, though naturally adapted for the CASCON themes and with added hands-on instruction.

Next steps