[10:05] At Hack/Reduce. It’s at a nice loft-style office space downtown.
[10:10] Installing git, msysGit, Tortoisegit, and also the git package in cygwin on the off chance one of them proves useful.
[10:30] They have Bixi usage, ocean carbon measurements, a Wikipedia dump and other sample datasets.
Hey, wasn’t there an interesting theorem in an XKCD comic recently about Wikipedia…
Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at “Philosophy”. – XKCD
[10:45] Some guy: I want to work on calculating the Philosophy-distance for every article on Wikipedioa.
[10:50] Me too.:P
[11:23] Installing Eclipse, either Gradle or Ant or both, and for the hell of it a Windows PATH editor.
[11:38] OK, OK, downloading Ubuntu. I have VMWare Workstation so I’m going to skip VirtualBox.
[11:39] On further thought, I already have 2 Ubuntu vms on this computer that I haven’t used in a year or two. I’ll just use one of them.
[12:31] I’m team 14 running stuff on cluster 4. Working solo, since that’s the way I roll. 😛
[12:35] Contrary to the organizers, I have it running on Windows with no problems. I’ll post instructions for getting it working on Windows as I go.
Running Hadoop and #HackReduce exercises on Windows
[13:26] A few notes for running it on Windows:
- Obviously, have Cygwin installed with git, wget, OpenSSH and other packages
- Add gradle, ant, and so on bin directories to your system PATH variable after installing them
- Re-launch any open command prompt or cygwin windows after changing the PATH
- In this command:
- java -classpath “.:build/libs/HackReduce-0.2.jar:lib/*” org.hackreduce.examples.wikipedia.RecordCounter datasets/wikipedia /tmp/wikipedia_recordcounts
- You need to:
- Change both the colons : to semicolons ;
- mkdir a tmp directory in your project folder and change the path accordingly: ./tmp/wikipedia_recordcounts
[13:34] Yum, free pizza
Parsing Mediawiki links using Java and Regex
[14:10] OH HAI GUYZ. I HEAR U LIEK REGEX. HEERS SUM REGEX.
A (likely incomplete) regular expression for finding out the target of an internal Wikipedia link:
- [[[^]:|]*|([^]:]+)]]|[[([^]:]+)]]
Notes:
- A Wikipedia link is of the form [[label|target]] or [[target]]
- For the purposes of this exercise, we want to avoid capturing stuff like [[Category:name]] or [[Image:name]]
- Not interested in {{templates}}
Let’s decompose the line noise:
- [[ means we are looking for something that starts with [[
- [^]:|]Â means we are looking for something that does not contain ], :, or |
- [^]:|]* means the above, but occurring between 0 and infinity times
- | means we are looking for something that does contain |
- [^]:]+ means similar to the above, but occurring between 1 and infinity times
- ([^]:]+) means we want to capture this substring
- ]] means we are looking for something that ends with ]]
- | means that, alternatively, we will ignore all of the above and define a second pattern to match
Oh, and since this is Java, we have to double up the escape slashes:
- \[\[[^\]\:\|]*\|([^\]\:]+)\]\]|\[\[([^\]\:]+)\]\]
[14:50] After a long while of running my job on my local machine, I’m now running it on the cluster against the full dataset. Whee!
Algorithm for calculating Philosophy distance
[15:51] Continuing to work at it. Talked the idea over with one of the mentors, who suggested this approach:
- Run Map-Reduce once to generate a list of articles linking directly to Philosophy
- (Ancient philosophy, Philosophy)
- (Mathematics, Philosophy)
- Etc.
- Run Map-Reduce a second time to generate a list of articles link to articles in the first set
- (Aristotle, Ancient philosophy, Philosophy)
- (Democritus, Ancient philosophy, Philosophy)
- (Euclid, Mathematics, Philosophy)
- Etc.
- Etc.
[16:31] I think I should wrap this up soon.
1-Philosophy set
[16:34] Here’s a list of all Wikipedia articles that are 1 link away from Philosophy.


It’s fairly straightforward to diff or compare different revisions of an Office document on Windows. The approach below applies to Word, Excel, and PowerPoint files, as well as to ones created by Lotus Symphony, OpenOffice.org, or LibreOffice.
My team at IBM recently released
A month ago, I disabled email notification in my Gmail notifier before doing a presentation and neglected to turn it back on later.

I just saw an
I recently saw an 