[10:05] At Hack/Reduce. It’s at a nice loft-style office space downtown.
[10:10] Installing git, msysGit, Tortoisegit, and also the git package in cygwin on the off chance one of them proves useful.
[10:30] They have Bixi usage, ocean carbon measurements, a Wikipedia dump and other sample datasets.
Hey, wasn’t there an interesting theorem in an XKCD comic recently about Wikipedia…
Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at “Philosophy”. – XKCD
[10:45] Some guy: I want to work on calculating the Philosophy-distance for every article on Wikipedioa.
[10:50] Me too.:P
[11:23] Installing Eclipse, either Gradle or Ant or both, and for the hell of it a Windows PATH editor.
[11:38] OK, OK, downloading Ubuntu. I have VMWare Workstation so I’m going to skip VirtualBox.
[11:39] On further thought, I already have 2 Ubuntu vms on this computer that I haven’t used in a year or two. I’ll just use one of them.
[12:31] I’m team 14 running stuff on cluster 4. Working solo, since that’s the way I roll. 😛
[12:35] Contrary to the organizers, I have it running on Windows with no problems. I’ll post instructions for getting it working on Windows as I go.
Running Hadoop and #HackReduce exercises on Windows
[13:26] A few notes for running it on Windows:
- Obviously, have Cygwin installed with git, wget, OpenSSH and other packages
- Add gradle, ant, and so on bin directories to your system PATH variable after installing them
- Re-launch any open command prompt or cygwin windows after changing the PATH
- In this command:
- java -classpath “.:build/libs/HackReduce-0.2.jar:lib/*” org.hackreduce.examples.wikipedia.RecordCounter datasets/wikipedia /tmp/wikipedia_recordcounts
- You need to:
- Change both the colons : to semicolons ;
- mkdir a tmp directory in your project folder and change the path accordingly: ./tmp/wikipedia_recordcounts
[13:34] Yum, free pizza
Parsing Mediawiki links using Java and Regex
[14:10] OH HAI GUYZ. I HEAR U LIEK REGEX. HEERS SUM REGEX.
A (likely incomplete) regular expression for finding out the target of an internal Wikipedia link:
- [[[^]:|]*|([^]:]+)]]|[[([^]:]+)]]
Notes:
- A Wikipedia link is of the form [[label|target]] or [[target]]
- For the purposes of this exercise, we want to avoid capturing stuff like [[Category:name]] or [[Image:name]]
- Not interested in {{templates}}
Let’s decompose the line noise:
- [[ means we are looking for something that starts with [[
- [^]:|]Â means we are looking for something that does not contain ], :, or |
- [^]:|]* means the above, but occurring between 0 and infinity times
- | means we are looking for something that does contain |
- [^]:]+ means similar to the above, but occurring between 1 and infinity times
- ([^]:]+) means we want to capture this substring
- ]] means we are looking for something that ends with ]]
- | means that, alternatively, we will ignore all of the above and define a second pattern to match
Oh, and since this is Java, we have to double up the escape slashes:
- \[\[[^\]\:\|]*\|([^\]\:]+)\]\]|\[\[([^\]\:]+)\]\]
[14:50] After a long while of running my job on my local machine, I’m now running it on the cluster against the full dataset. Whee!
Algorithm for calculating Philosophy distance
[15:51] Continuing to work at it. Talked the idea over with one of the mentors, who suggested this approach:
- Run Map-Reduce once to generate a list of articles linking directly to Philosophy
- (Ancient philosophy, Philosophy)
- (Mathematics, Philosophy)
- Etc.
- Run Map-Reduce a second time to generate a list of articles link to articles in the first set
- (Aristotle, Ancient philosophy, Philosophy)
- (Democritus, Ancient philosophy, Philosophy)
- (Euclid, Mathematics, Philosophy)
- Etc.
- Etc.
[16:31] I think I should wrap this up soon.
1-Philosophy set
[16:34] Here’s a list of all Wikipedia articles that are 1 link away from Philosophy.
One thought on “At the #hackreduce Hadoop workshop”