Blog

  • At the #hackreduce Hadoop workshop

    [10:05] At Hack/Reduce. It’s at a nice loft-style office space downtown.

    [10:10] Installing git, msysGit, Tortoisegit, and also the git package in cygwin on the off chance one of them proves useful.

    [10:30] They have Bixi usage, ocean carbon measurements, a Wikipedia dump and other sample datasets.

    Hey, wasn’t there an interesting theorem in an XKCD comic recently about Wikipedia…

    Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at “Philosophy”. – XKCD

    [10:45] Some guy: I want to work on calculating the Philosophy-distance for every article on Wikipedioa.

    [10:50] Me too.:P

    [11:23] Installing Eclipse, either Gradle or Ant or both, and for the hell of it a Windows PATH editor.

    [11:38] OK, OK, downloading Ubuntu. I have VMWare Workstation so I’m going to skip VirtualBox.

    [11:39] On further thought, I already have 2 Ubuntu vms on this computer that I haven’t used in a year or two. I’ll just use one of them.

    [12:31] I’m team 14 running stuff on cluster 4. Working solo, since that’s the way I roll. 😛

    [12:35] Contrary to the organizers, I have it running on Windows with no problems. I’ll post instructions for getting it working on Windows as I go.

    Running Hadoop and #HackReduce exercises on Windows

    [13:26] A few notes for running it on Windows:

    • Obviously, have Cygwin installed with git, wget, OpenSSH and other packages
    • Add gradle, ant, and so on bin directories to your system PATH variable after installing them
    • Re-launch any open command prompt or cygwin windows after changing the PATH
    • In this command:
      • java -classpath “.:build/libs/HackReduce-0.2.jar:lib/*” org.hackreduce.examples.wikipedia.RecordCounter datasets/wikipedia /tmp/wikipedia_recordcounts
    • You need to:
      • Change both the colons : to semicolons ;
      • mkdir a tmp directory in your project folder and change the path accordingly: ./tmp/wikipedia_recordcounts

    [13:34] Yum, free pizza

    Parsing Mediawiki links using Java and Regex

    [14:10] OH HAI GUYZ. I HEAR U LIEK REGEX. HEERS SUM REGEX.

    A (likely incomplete) regular expression for finding out the target of an internal Wikipedia link:

    • [[[^]:|]*|([^]:]+)]]|[[([^]:]+)]]

    Notes:

    • A Wikipedia link is of the form [[label|target]] or [[target]]
    • For the purposes of this exercise, we want to avoid capturing stuff like [[Category:name]] or [[Image:name]]
    • Not interested in {{templates}}

    Let’s decompose the line noise:

    • [[ means we are looking for something that starts with [[
    • [^]:|]  means we are looking for something that does not contain ], :, or |
    • [^]:|]* means the above, but occurring between 0 and infinity times
    • | means we are looking for something that does contain |
    • [^]:]+ means similar to the above, but occurring between 1 and infinity times
    • ([^]:]+) means we want to capture this substring
    • ]] means we are looking for something that ends with ]]
    • | means that, alternatively, we will ignore all of the above and define a second pattern to match

    Oh, and since this is Java, we have to double up the escape slashes:

    • \[\[[^\]\:\|]*\|([^\]\:]+)\]\]|\[\[([^\]\:]+)\]\]

    [14:50] After a long while of running my job on my local machine, I’m now running it on the cluster against the full dataset. Whee!

    Algorithm for calculating Philosophy distance

    [15:51] Continuing to work at it. Talked the idea over with one of the mentors, who suggested this approach:

    • Run Map-Reduce once to generate a list of articles linking directly to Philosophy
      • (Ancient philosophy, Philosophy)
      • (Mathematics, Philosophy)
      • Etc.
    • Run Map-Reduce a second time to generate a list of articles link to articles in the first set
      • (Aristotle, Ancient philosophy, Philosophy)
      • (Democritus, Ancient philosophy, Philosophy)
      • (Euclid, Mathematics, Philosophy)
      • Etc.
    • Etc.

    [16:31] I think I should wrap this up soon.

    1-Philosophy set

    [16:34] Here’s a list of all Wikipedia articles that are 1 link away from Philosophy.

  • DB2 images for IBM Workload Deployer

    IBM Workload Deployer 3.0 has just come out along with the DB2 images I developed. IWD is a major revision of what was formerly known as WebSphere CloudBurst Appliance.  The idea behind it is that it lets you plug a box into your existing virtualization infrastructure (VMWare, pSeries, etc) and make it feel more like a proper private cloud with image templates, automation, reproducible deployments, monitoring, and such things.

    I developed the three DB2 images  and, along with Dustin, the script packages and topology patterns that make them integrate nicely with WebSphere Application Server. Here’s a screenshot of what one of those patterns looks like in the IWD pattern editor:

    Highly available cluster pattern on IWD
    In addition to the images made available on the appliance, we are making five additional DB2 image templates available for download. They add Red Hat Enterprise Linux as an OS option as well as bump up the DB2 version to V9.7 FP4 and enhance the High Availability enablement on the AIX-based DB2 Enterprise image.

    When you hear WebSphere folks talking about “DB2 Hypervisor Edition”, these images are what they are talking about.

    I learned a lot about the arcana of Linux and AIX administration developing these, though of course learning a lot about something always highlights how much more one has to learn. Regardless, I’ve harvested some of that for a couple blogposts this past winter, and I hope to post more about it in the coming weeks.

    On a side note, if you are interested in more of a Database-as-a-Service rather than Infrastructure-as-a-Service approach, IWD 3.0 also comes with Workload Pattern for DB2. It abstracts things to a higher level by letting you provision databases directly, which can be a nice option to have.

  • How to diff Word documents

    It’s fairly straightforward to diff or compare different revisions of an Office document on Windows. The approach below applies to Word, Excel, and PowerPoint files, as well as to ones created by Lotus Symphony, OpenOffice.org, or LibreOffice.

    1. Download and install WinMerge. This is a free, open source utility.
    2. Download the xdocdiff plugin. Unzip it somewhere.
    3. Copy xdoc2txt.exe and zlib.dll to C:Program FilesWinMerge
    4. Copy amb_xdocdiffPlugin.dll to C:Program FilesWinMergeMergePlugins
    5. Start WinMerge.
    6. Go to Plugins > List and check [x] Enable plugins.
    7. Go to Plugins and set it to [x] Automatic unpacking
    8. Close or restart WinMerge

    You should now be able to select any two documents that you want to compare, right-click on them, and choose WinMerge to get a meaningful comparison of the textual differences between them.

    If you are seeing line noise in the comparison, you need to make sure you enable the settings mentioned in steps 6 and 7 above.

  • Triggers in DB2 Express-C 9.7.4

    My team at IBM recently released DB2 Express-C 9.7.4, the latest and greatest version of our free database.

    Raul wrote up a detailed article with the technical nitty-gritty of what’s new. There’s a bunch of different improvements, but one thing that’s caught my eye are the enhancements to triggers.

    A trigger is something defined to fire automatically when you insert, update, or delete a row in a table. Starting with 9.7.4, you can basically inline a whole stored procedure in the trigger definition. This is nice because it lets you keep the code for all the different actions on a table together.

    Let me quote Raul’s example:

    CREATE TABLE COMPANY_STATS (NBEMP INTEGER)
    !
    
    CREATE TRIGGER HIRED
     AFTER INSERT OR DELETE OR UPDATE OF SALARY ON EMPLOYEE
     REFERENCING NEW AS N OLD AS O FOR EACH ROW
       BEGIN
             IF INSERTING
             THEN UPDATE COMPANY_STATS SET NBEMP = NBEMP + 1;
             END IF;
    
             IF DELETING
             THEN UPDATE COMPANY_STATS SET NBEMP = NBEMP - 1;
             END IF;
    
             IF (UPDATING AND (N.SALARY > 1.1 * O.SALARY))
             THEN SIGNAL SQLSTATE '75000' SET MESSAGE_TEXT='Salary increase>10%';
             END IF;
       END
    !

    Ignore that last part. All salary increases should be > 10%.

  • Reduce your stress by disabling notifiers, toasts, and every sort of popup

    A month ago, I disabled email notification in my Gmail notifier before doing a presentation and neglected to turn it back on later.

    It took me a long time to notice the lack of notifications. What I did notice was a reduced level of stress. I was able to effectively concentrate on a single task without unimportant, offtopic notices distracting me. This is enormously important in software development.

    Let me quote from Eric S. Raymond’s classic Jargon File:

    hack mode n.

    a Zen-like state of total focus on The Problem that may be achieved when one is hacking (this is why every good hacker is part mystic). Ability to enter such concentration at will correlates strongly with wizardliness; it is one of the most important skills learned during larval stage. Sometimes amplified as deep hack mode.

    Being yanked out of hack mode (see priority interrupt) may be experienced as a physical shock, and the sensation of being in hack mode is more than a little habituating. The intensity of this experience is probably by itself sufficient explanation for the existence of hackers, and explains why many resist being promoted out of positions where they can code. See also cyberspace (sense 3).

    Some aspects of hacker etiquette will appear quite odd to an observer unaware of the high value placed on hack mode. For example, if someone appears at your door, it is perfectly okay to hold up a hand (without turning one’s eyes away from the screen) to avoid being interrupted. One may read, type, and interact with the computer for quite some time before further acknowledging the other’s presence (of course, he or she is reciprocally free to leave without a word). The understanding is that you might be in hack mode with a lot of delicate state (sense 2) in your head, and you dare not swap that context out until you have reached a good point to pause. See also juggling eggs.

    Joel Spolsky wrote something similar in 2001. The basic idea is that multitasking is inherently wasteful because a context switch between one complicated task and another complicated task has costs. The more often you switch between tasks, the more often you incur the overhead of a context switch.

    My advice to you is this: disable your email notifier, disable your Twitter notifier, disable every other sort of notifier you have. They are never as urgent as the task at hand. You’ll only be happier and more productive.

  • Find out public information about the people you email with Rapportive

    Rapportive is a great browser enhancement for Gmail. It automatically looks up email addresses and populates a sidebar with that person’s profile photo (from Google Talk or Flickr), job title (from LinkedIn), tweets, as well as links to their profiles on Facebook, Skype, etc.

    I find it especially useful when reading mailing list messages, as it  lets me easily find the twitter accounts of interesting people.

    It’s also useful for making you think critically about the information you have exposed online. By looking at your own profile, you can find out if there’s any information that you are exposing without meaning to. In my own case, I was surprised to see my ancient Flickr account from circa 2003 on it which I’ve since made private.

    I think it ties in nicely with IBM’s recent study that 21% of email users would consider applications to complement email. Lotus is also cooking up a lot of neat things that integrate social media with email. A hat tip goes to Marius for the link.

  • Linux Basics: Navigate with ls, cd, and pwd

    The Linux command line can be a bit intimidating at first, but it gets much easier once you learn a few basic building blocks. The power of the command line lies in combining many basic commands in interesting ways.

    Open up a Linux terminal (or, if you want to follow along on Windows, Cygwin). The darkness of the abyss stares at you, but it’s really not as unfriendly as all that.

    First things first. Where am I? Type “pwd” without quotes and hit enter:

    $ pwd
    
    /home/leonsp

    You should see something like “/home/leonsp”. That’s your home directory, similar to your Documents folder on Windows. “pwd” stands for “print working directory”, and is a handy way to find out your current location.

    Let’s try going somewhere else. Enter “cd ..”, followed by “pwd”.

    $ cd ..
    $ pwd
    
    /home

    Two dots means one directory up. You just changed the directory to one directory up from /home/leonsp, which is /home.

    Let’s get back to your home directory. There’s a few ways to do this. The following commands will all do the same thing:

    $ cd /home/leonsp
    $ cd ~
    $ cd

    ~ is convenient shorthand that refers to the home directory of the current user — you.

    What’s in all these directories? Let’s list the contents using the “ls” command:

    $ ls

    Not much will show up, as your home directory starts out with few files in it. It will, however, have a bunch of hidden files. Let’s list all of these:

    $ ls -a
    
    .  ..  .bash_history  .bash_profile  .bashrc  .inputrc  .lesshst  .ssh  .subversion

    By convention, Linux treats all filenames starting with a dot as hidden. They will only show in the listing when you ask to see them.

    .bash_history, .bash_profile, and .bashrc are configuration files for my shell, Bash. Which shell you have mostly affects scripting (or automation), which is a more advanced topic. Bash, ksh, and sh are similar, while csh and tcsh are a bit different.

    . and .. will show up everywhere. Single dot “.” refers to the current directory, and two dots “..” refer to the parent directory.

    It’s possible to get a long, more detailed listing:

    $ ls -la
    
    drwxr-xr-x+ 1 leonsp None     0 Feb 22 15:30 .
    drwxrwxrwt+ 1 leonsp root     0 Jan 28  2010 ..
    -rw-------  1 leonsp None 17431 Apr  5 00:16 .bash_history
    -rwxr-xr-x  1 leonsp None  1150 Jan 28  2010 .bash_profile
    -rwxr-xr-x  1 leonsp None  3754 Jan 28  2010 .bashrc
    -rwxr-xr-x  1 leonsp None  1461 Jan 28  2010 .inputrc
    -rw-------  1 leonsp None    35 Jan 10 16:48 .lesshst
    drwx------+ 1 leonsp None     0 Mar 30 13:05 .ssh
    drwxr-xr-x+ 1 leonsp None     0 Feb 22 15:30 .subversion
    -rw-r--r--  1 leonsp None     0 Aug 26  2010 blah

    We’ll get into what these columns mean later.

  • MySpace’s death spiral due to no test environment, no source control, no code review

    I just saw an extremely insightful comment by Nick Kwiatkowski quoted whole by the High Scalability blog. Unfortunately, they buried the lede very deep, while I don’t trust the original medium of Disqus comments to have any longevity.

    Here’s the comment for posterity:

    Having been in a position where I was able to work with some of the programmers who worked at MySpace, the issue wasn’t the engine (whether it was on ColdFusion or .NET), it was the environment they choose to breed for their developers.Management would say “We need X feature NOW to remain competitive”. They would then select a group of developers to implement that feature. The biggest problem was they didn’t allow the developers to have staging or testing serversthey deployed on the production servers on the first go-around. Sometimes these developers were given 5 or 10 projects that they had to deploy in very little time. And all this with no change management or control. No versioning either.

    MySpace management never wanted to go back and review code or make it more efficient. Their solution was “more servers”. They ended up hiring a crew whose sole job was to install more servers. Meanwhile they had developers checking in buggy code and they were racking up technical debt at an alarming rate. At the time MySpace was running two major versions of their application server behind what was recommended for use. When Microsoft & New Atlanta came around, they jumped at the idea to essentially sell off their technical debt (like a mortgage to a financial firm), and have somebody else take care of their problem.

    The problem then was Microsoft was not updating their old code, they simply were adding new features on .NET. This didn’t solve their problems and left them in a situation where they still needed to fix the old stuff, all the while updating new code.

    The issue with MySpace was this : they are a classic example of when you don’t listen and you accumulate too much technical debt. Fixing old stuff should be a priority, and doing things like change management, version control, testing and development servers, etc. are all a must. This is why the bookface is able to deploy new changes with little impact — they have everything tested and proofed out before they let their actual users play with it.

    (I will take this down on request. I appreciate that quoting something whole is generally a bad idea.)

    Maybe this is just my bias as a developer, but to me this fully explains the failure of MySpace and the success of Facebook. The latter has a reputation of following good software engineering practices — separate development, testing, and production environments; source control; and so on. The former turns out to have gone out of its way to sabotage their own software.

  • Twitter reinvents the vocative case

    I recently saw an interesting idea on reddit: that the @name notation popularized by Twitter and now adopted on many internet forums as a way of addressing someone is basically a reinvention of the vocative case.

    What’s a case? Well, English has three cases for nouns: subjective, objective, and possessive.

    The subjective noun acted on the possessive noun’s objective noun.

    These roughly correspond to Latin’s nominative, accusative, and genitive, but are quite different in the fiddly details.

    Latin had several other cases, one of which was the vocative. This case was used to address people and things in what you said. Some English examples would be:

    1. O Canada, we stand on guard for thee.
    2. Yo homes, smell you later.
    3. Hey buddy, pass the salt.
    4. Mother, should I run for President?

    All of these would be in the vocative case in Latin, but to my knowledge there isn’t a formal grammar for it in English.

    By introducing a standard notation for addressing @someone, Twitter regularizes this in English. In a way, it is a formalization of the grammar and an inflection of the noun.

    I think that the ultimate test for whether this a grammatical change or a passing fad is whether @name will make it into print outside of discussions of Twitter itself. If it does, the vocative will be reinvented.

  • What is DB2 Hypervisor Edition?

    Edit: Please see DB2 HV images for IWD.

    It’s regular DB2. There’s no DB2 HV or DB2 HE. You use the same DB2 edition whether you are using it on physical hardware or a virtual server in the cloud. In fact, DB2 is fully supported in virtually every virtualized environment.

    9 to 5, I’m the lead developer of DB2 database server images for WebSphere CloudBurst Appliance. What’s WCA? Well, it’s a purple box that takes your existing VMWare ESX or ESXi or PowerVM machines and turns them into a private cloud.

    WCA has a shiny web interface that makes it easy to deploy patterns of machines running WebSphere (WAS), DB2, etc. and then redeploy the same environment for development, test, production, etc. It’s easy to configure the OSes to update themselves, to get everything to talk to each other, and so on.

    The appliance is like a vending machine loaded up with delicious IBM licenses. Want a virtual machine with a DB2 database? Boom, here you go.

    WCA is a private cloud solution. If you want to get started right now, we also have DB2 images for public clouds like Amazon EC2 and IBM Cloud, as well as templates targeting RightScale.

    In fact, you can try an alpha of the next version of DB2 in the cloud today. Once you are approved, it takes all of ten minutes to get your own server running.

    If you already have your very own WebSphere CloudBurst Appliance, you can download DB2 for WCA, get some script packages, or read a technical guide.