IBM’s Hadoop distribution

My work for the past couple years has been to develop DB2 images and templates for various cloud platforms and to engage the DB2 community online. This is still the case, but increasingly I’m spending my time working with IBM InfoSphere BigInsights.

BigInsights is IBM’s distribution of Hadoop.

What’s Hadoop? It’s a great way to crunch through massive amounts of unstructured data like email archives, geographic stuff, economic measurements, and so on to find interesting patterns. It rests on the Map-Reduce algorithm, which is what Google uses when you search. Much Google’s success rests on Map-Reduce’s ability to scale out on commodity hardware.

(Notably, the whole Cloud Computing thing is the flip side of using massive arrays of commodity hardware. Since you have so much of it, you need a way to automate and abstract the management as much as possible. Since you’ve automated and abstracted away management, you might as well sell it as a service.)

Hadoop itself is an Apache Software Foundation project nurtured by Yahoo among others. It’s gaining an increasing number of commercial distributions including Cloudera, IBM, and now Hortonworks.

You can quickly try out BigInsights Basic on the cloud or download it to your own machine.

I really do recommend the macro that

At the #hackreduce Hadoop workshop

[10:05] At Hack/Reduce. It’s at a nice loft-style office space downtown.

[10:10] Installing git, msysGit, Tortoisegit, and also the git package in cygwin on the off chance one of them proves useful.

[10:30] They have Bixi usage, ocean carbon measurements, a Wikipedia dump and other sample datasets.

Hey, wasn’t there an interesting theorem in an XKCD comic recently about Wikipedia…

Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at “Philosophy”. – XKCD

[10:45] Some guy: I want to work on calculating the Philosophy-distance for every article on Wikipedioa.

[10:50] Me too.:P

[11:23] Installing Eclipse, either Gradle or Ant or both, and for the hell of it a Windows PATH editor.

[11:38] OK, OK, downloading Ubuntu. I have VMWare Workstation so I’m going to skip VirtualBox.

[11:39] On further thought, I already have 2 Ubuntu vms on this computer that I haven’t used in a year or two. I’ll just use one of them.

[12:31] I’m team 14 running stuff on cluster 4. Working solo, since that’s the way I roll. :P

[12:35] Contrary to the organizers, I have it running on Windows with no problems. I’ll post instructions for getting it working on Windows as I go.

Running Hadoop and #HackReduce exercises on Windows

[13:26] A few notes for running it on Windows:

  • Obviously, have Cygwin installed with git, wget, OpenSSH and other packages
  • Add gradle, ant, and so on bin directories to your system PATH variable after installing them
  • Re-launch any open command prompt or cygwin windows after changing the PATH
  • In this command:
    • java -classpath “.:build/libs/HackReduce-0.2.jar:lib/*” org.hackreduce.examples.wikipedia.RecordCounter datasets/wikipedia /tmp/wikipedia_recordcounts
  • You need to:
    • Change both the colons : to semicolons ;
    • mkdir a tmp directory in your project folder and change the path accordingly: ./tmp/wikipedia_recordcounts

[13:34] Yum, free pizza

Parsing Mediawiki links using Java and Regex

[14:10] OH HAI GUYZ. I HEAR U LIEK REGEX. HEERS SUM REGEX.

A (likely incomplete) regular expression for finding out the target of an internal Wikipedia link:

  • \[\[[^\]\:\|]*\|([^\]\:]+)\]\]|\[\[([^\]\:]+)\]\]

Notes:

  • A Wikipedia link is of the form [[label|target]] or [[target]]
  • For the purposes of this exercise, we want to avoid capturing stuff like [[Category:name]] or [[Image:name]]
  • Not interested in {{templates}}

Let’s decompose the line noise:

  • \[\[ means we are looking for something that starts with [[
  • [^\]\:\|]  means we are looking for something that does not contain ], :, or |
  • [^\]\:\|]* means the above, but occurring between 0 and infinity times
  • \| means we are looking for something that does contain |
  • [^\]\:]+ means similar to the above, but occurring between 1 and infinity times
  • ([^\]\:]+) means we want to capture this substring
  • \]\] means we are looking for something that ends with ]]
  • | means that, alternatively, we will ignore all of the above and define a second pattern to match

Oh, and since this is Java, we have to double up the escape slashes:

  • \\[\\[[^\\]\\:\\|]*\\|([^\\]\\:]+)\\]\\]|\\[\\[([^\\]\\:]+)\\]\\]

[14:50] After a long while of running my job on my local machine, I’m now running it on the cluster against the full dataset. Whee!

Algorithm for calculating Philosophy distance

[15:51] Continuing to work at it. Talked the idea over with one of the mentors, who suggested this approach:

  • Run Map-Reduce once to generate a list of articles linking directly to Philosophy
    • (Ancient philosophy, Philosophy)
    • (Mathematics, Philosophy)
    • Etc.
  • Run Map-Reduce a second time to generate a list of articles link to articles in the first set
    • (Aristotle, Ancient philosophy, Philosophy)
    • (Democritus, Ancient philosophy, Philosophy)
    • (Euclid, Mathematics, Philosophy)
    • Etc.
  • Etc.

[16:31] I think I should wrap this up soon.

1-Philosophy set

[16:34] Here’s a list of all Wikipedia articles that are 1 link away from Philosophy.