At the #hackreduce Hadoop workshop

[10:05] At Hack/Reduce. It’s at a nice loft-style office space downtown.

[10:10] Installing git, msysGit, Tortoisegit, and also the git package in cygwin on the off chance one of them proves useful.

[10:30] They have Bixi usage, ocean carbon measurements, a Wikipedia dump and other sample datasets.

Hey, wasn’t there an interesting theorem in an XKCD comic recently about Wikipedia…

Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at “Philosophy”. – XKCD

[10:45] Some guy: I want to work on calculating the Philosophy-distance for every article on Wikipedioa.

[10:50] Me too.:P

[11:23] Installing Eclipse, either Gradle or Ant or both, and for the hell of it a Windows PATH editor.

[11:38] OK, OK, downloading Ubuntu. I have VMWare Workstation so I’m going to skip VirtualBox.

[11:39] On further thought, I already have 2 Ubuntu vms on this computer that I haven’t used in a year or two. I’ll just use one of them.

[12:31] I’m team 14 running stuff on cluster 4. Working solo, since that’s the way I roll. :P

[12:35] Contrary to the organizers, I have it running on Windows with no problems. I’ll post instructions for getting it working on Windows as I go.

Running Hadoop and #HackReduce exercises on Windows

[13:26] A few notes for running it on Windows:

  • Obviously, have Cygwin installed with git, wget, OpenSSH and other packages
  • Add gradle, ant, and so on bin directories to your system PATH variable after installing them
  • Re-launch any open command prompt or cygwin windows after changing the PATH
  • In this command:
    • java -classpath “.:build/libs/HackReduce-0.2.jar:lib/*” org.hackreduce.examples.wikipedia.RecordCounter datasets/wikipedia /tmp/wikipedia_recordcounts
  • You need to:
    • Change both the colons : to semicolons ;
    • mkdir a tmp directory in your project folder and change the path accordingly: ./tmp/wikipedia_recordcounts

[13:34] Yum, free pizza

Parsing Mediawiki links using Java and Regex


A (likely incomplete) regular expression for finding out the target of an internal Wikipedia link:

  • \[\[[^\]\:\|]*\|([^\]\:]+)\]\]|\[\[([^\]\:]+)\]\]


  • A Wikipedia link is of the form [[label|target]] or [[target]]
  • For the purposes of this exercise, we want to avoid capturing stuff like [[Category:name]] or [[Image:name]]
  • Not interested in {{templates}}

Let’s decompose the line noise:

  • \[\[ means we are looking for something that starts with [[
  • [^\]\:\|]  means we are looking for something that does not contain ], :, or |
  • [^\]\:\|]* means the above, but occurring between 0 and infinity times
  • \| means we are looking for something that does contain |
  • [^\]\:]+ means similar to the above, but occurring between 1 and infinity times
  • ([^\]\:]+) means we want to capture this substring
  • \]\] means we are looking for something that ends with ]]
  • | means that, alternatively, we will ignore all of the above and define a second pattern to match

Oh, and since this is Java, we have to double up the escape slashes:

  • \\[\\[[^\\]\\:\\|]*\\|([^\\]\\:]+)\\]\\]|\\[\\[([^\\]\\:]+)\\]\\]

[14:50] After a long while of running my job on my local machine, I’m now running it on the cluster against the full dataset. Whee!

Algorithm for calculating Philosophy distance

[15:51] Continuing to work at it. Talked the idea over with one of the mentors, who suggested this approach:

  • Run Map-Reduce once to generate a list of articles linking directly to Philosophy
    • (Ancient philosophy, Philosophy)
    • (Mathematics, Philosophy)
    • Etc.
  • Run Map-Reduce a second time to generate a list of articles link to articles in the first set
    • (Aristotle, Ancient philosophy, Philosophy)
    • (Democritus, Ancient philosophy, Philosophy)
    • (Euclid, Mathematics, Philosophy)
    • Etc.
  • Etc.

[16:31] I think I should wrap this up soon.

1-Philosophy set

[16:34] Here’s a list of all Wikipedia articles that are 1 link away from Philosophy.

Published by

Leons Petrazickis

I'm a full-stack developer at IBM Analytics Emerging Technologies. I do Ruby, JS, Python, Hadoop, Spark, as well as web scale devops with Chef and Docker. My opinions are my own.

One thought on “At the #hackreduce Hadoop workshop”

Comments are closed.