Blog

CASCON 2012

On November 6, 2012, I’m teaching a hands-on lab at CASCON together with Bradley Steinfeld and Marius Butuc. The lab is called Crunching Big Data with Hadoop and BigInsights in the Cloud. The lab is based on the Hadoop Fundamentals course at Big Data University.

Morning

1.0 Welcome
1.1 What is Big Data?
1.2 Lab Setup
– Setup Lab
– Setup Lab (PDF Download)
1.3 What is Hadoop?
1.4 Hadoop Architecture – HDFS
– HDFS Lab
– Lab (PDF Download)
1.5 Hadoop Architecture – MapReduce
– MapReduce Lab
– Lab (PDF Download)

Afternoon

1.6 Pig, Hive, and Jaql
– Pig, Hive, and Jaql Lab
– Lab (PDF Download)
1.8 Working with BigInsights
– Web Console Lab
– Web Console Lab (PDF Download)
1.9 Data Discovery with BigSheets

Module 1.7 covers Flume. It’s available for free on Big Data University.

November 5, 2012
ODBC and 32-bit Excel on Windows 7 x64
I do some reporting in Excel. The reporting involves loading data via ODBC from a DB2 database. Excel is pretty zippy with its pivot tables once the data is loaded, but setting up the initial connection can be tricky.

Windows 7 is the first Microsoft operating system where the expectation is that the consumers would run the 64-bit version. However, Office hasn’t caught up yet, and 32-bit is the default for Office 2010.

32-bit Excel can’t see 64-bit ODBC data sources.

Windows comes with entirely separate 64-bit and 32-bit ODBC control panels. The 64-bit ODBC control panel is the default, and the 32-bit ODBC control panel is not even listed in the in the main control panel. You need to invoke it via Start > Run.

To invoke the 32-bit ODBC control panel, use the following command:
```
%systemdrive%WindowsSysWoW64Odbcad32.exe
```
Once you have the control panel open, it should be straightforward to define a new System data source to your database.
October 3, 2012
TypeScript’s doomed embrace of JavaScript

Microsoft recently announced TypeScript. From what I can tell, it’s Javascript with optional types. The type annotation syntax is the same as in Adobe’s ActionScript and in the sadly defunct ECMAScript 4.

TypeScript also includes a new class syntax based on the one proposed in ECMAScript 6. I’m dubious about the addition of class-based features to Javascript. Javascript’s traditional strength is in prototypal inheritance and informal interfaces. I think there’s value in there being only one way to do things in a programming language.

As the Zen of Python puts it:

“There should be one– and preferably only one –obvious way to do it.”

And as Ruby’s designer Matsumoto describes the Principle of Least Astonishment:

“Everyone has an individual background. Someone may come from Python, someone else may come from Perl, and they may be surprised by different aspects of the language. Then they come up to me and say, ‘I was surprised by this feature of the language, so Ruby violates the principle of least surprise.’ Wait. Wait. The principle of least surprise is not for you only. The principle of least surprise means principle of least my surprise [sic]. And it means the principle of least surprise after you learn Ruby very well. For example, I was a C++ programmer before I started designing Ruby. I programmed in C++ exclusively for two or three years. And after two years of C++ programming, it still surprises me.”

I’ve written Perl, and I’ve written C++. Both of them are kitchen sink languages that allow for every possible way of doing things. Per the above, this is not a good thing. For a discussion of the problems with C++, try the C++ Frequently Questioned Answers (FQA).

I think adding classes to Javascript would detract from the beauty of the language, or at least the beauty of JavaScript’s Good Parts. I would suggest that a much more useful and Javascript-esque enhancement would be not adding classes, but rather adding Go-style interfaces.

TypeScript is unlikely to become any more successful than Google’s Dart, thought to give Microsoft credit TypeScript is a lot more compatible with the existing Javascript infrastructure. The only post-Javascript success story that I’m aware of is CoffeeScript, and CoffeeScript’s strength is that it does not seek to replace Javascript.

(On a side note, I found out about TypeScript from Eric Lippert’s How do we ensure that method type inference terminates?)

October 2, 2012
Dehacking this blog
The first rule of security is to, of course, assume everything is compromised. If some code is compromised, everything is compromised. The correct response to a hacked WordPress is to nuke all the code.

My WordPress installation was recently compromised. There’s a limit to how far I can apply the principle because this particular WordPress is currently on shared hosting, but all code I have access to is now nuked. WordPress has been reinstalled from scratch, and all the various hanger-on sites that had accumulated in the same hosting account are now no more.

I’ve also adopted the pertinent steps from My WordPress Site Was Hacked, Hardening WordPress, and the Ultimate Security Checker plugin (guide).

Last line of defense:
```
grep base64_decode -R *
```
```
grep gzinflate -R *
```
The attack’s objective was to inject PHP code into various pages. The code was obfuscated via a double pass through those two functions. The two shell commands above will show any instances of those two functions.
September 25, 2012
New Features in Hadoop 2.0 summary
I live-tweeted yesterday’s New Features in Hadoop 2.0 session at the Toronto Hadoop User Group. I think it’s a pretty good summary. If nothing else, it helped me absorb the information.

Spoiler: The iPad raffle was won by yours truly.

Hadoop 2.0
- At the Hadoop 2.0 session â€ª#TorontoHUG
- Got here just before they figured out how to get the projector to emerge. It’s shy and hides in the ceiling. â€ª#ToHUGâ€¬
- Also, just after pizza got here. Maybe I shouldn’t have grabbed that emergency bagel. â€ª#ToHUGâ€¬
- Hadoop 0.20 -> Hadoop 0.20.2 -> Hadoop 2.0 â€ª#ToHUGâ€¬
- Hive, which is SQL on Hadoop w/ JDBC drivers, now has Binary, Timestamp data types and bitmap indexes â€ª#ToHUGâ€¬
- Pig gets Javascript UDFs and can be embedded in JS or Python â€ª#ToHUGâ€¬
- Cool, Sqoop getting an IBM DB2 connector is a bullet point in a Cloudera presentation â€ª#ToHUGâ€¬
MapReduce v2
- MapReduce v2=YARN=Yet Another Resource Negotiator #ToHUGâ€¬
- MRv2 can support other processing frameworks. Eg graph processing, Santa Fe Institute simulations, etc â€ª#ToHUGâ€¬
- MRv2 is not about old API/new API. Unrelated â€ª#ToHUGâ€¬
- MRv2 good for research but not ready for production, even by startup standards â€ª#ToHUGâ€¬
- MRv2 allows Hamster MPI on Hadoop, Hama bulk synchronous processing, Giraph graph processing — none are MapReduce â€ª#ToHUGâ€¬
- In MRv1, JobTracker ran on Master, TaskTrackers ran on child nodes â€ª#ToHUGâ€¬
- In MRv1, JobTracker managed resouurces, scheduled, monitored
- Hadoop 2.0 can run either MRv1 or MRv2, but v2 not recommended for production â€ª#ToHUGâ€¬
- MRv2 has 1 Resource Manager, many Node Managers instead â€ª#ToHUGâ€¬
- But unlike JT, RM not a single point of failure because it now delegates App Managers to Node Managers per job â€ª#ToHUGâ€¬
- Resource management still central, but job management now decentralized â€ª#ToHUGâ€¬
- App Manager is like a library injected by RM into Node Managers â€ª#ToHUGâ€¬
- RM can die with low risk of losing jobs â€ª#ToHUGâ€¬
- App Master manages app lifecycle, negotiates resource containers with Resource Manager, monitors tasks on other nodes â€ª#ToHUGâ€¬
- In principle, no issue with running MRv2 on either HDFS or GPFS â€ª#ToHUGâ€¬
HDFS Federation
- The old Secondary NameNode is a terrible misnomer — not a backup NN â€ª#ToHUGâ€¬
- NameNode keeps track of all the data on all the DataNodes â€ª#ToHUGâ€¬
- HDFS Federation now allows for multiple NameNodes â€ª#ToHUGâ€¬
- In federation, each NN manages a namespace volume â€ª#ToHUGâ€¬
- HDFS Federation is not High Availability, is not Disaster Recovery â€ª#ToHUGâ€¬
- NNs in Federation do not communicate â€ª#ToHUGâ€¬
- Federation improves scalability, perf, isolation â€ª#ToHUGâ€¬
- A fed namespace volume consists of metadata, block pool (corresponding to files)â€ª#ToHUGâ€¬
- All Data Nodes are used by all the federated NNs â€ª#ToHUGâ€¬
HDFS High Availability
- NameNode High Availability is a new feature different from NN Fed â€ª#ToHUGâ€¬
- Two NNs: one active, one standy. Standby takes over on failure. Fencing to prevent split brain by killing old one on takeover. â€ª#ToHUGâ€¬
- Non-HA NN can fail via crash or planned maintenance. â€ª#ToHUGâ€¬
- Clients and DNs only talk to active NN. Standy maintains a copy of active’s state. Purpose of NN unchanged. â€ª#ToHUGâ€¬
- Active NN writes state to shared filesystem — NFS. â€ª#ToHUGâ€¬
- HDFS fences to kill splitbrain via SSH or shell scripts. â€ª#ToHUGâ€¬
- NFS is the new single point of failure — yay! But NFS has proven HA solutions as well. â€ª#ToHUGâ€¬
- Failover can be auto or manual. Use hdfs haadmiin -failover nn01 nn02 command â€ª#ToHUGâ€¬
- HA is in Hadoop 2.0 regardless of MRv1 or v2. â€ª#ToHUGâ€¬
- NameNodes runs on your beefiest machine. Upwards of 16gb of ram typical. Limit is JVM memory management, Java garbage collector. â€ª#ToHUGâ€¬
- OMGWTFBBQ I just won the iPad3 raffle â€ª#ToHUGâ€¬
- The IBM guy got it. “I swear they don’t pay me very much”. Heh. â€ª#embarrassedâ€¬ â€ª#woohooâ€¬ â€ª#ToHUGâ€¬
HBase
- HBase is a distributed, versioned, column-oriented, denormalized database â€ª#ToHUGâ€¬
- Horizontally scalable for fast random r/w. HBase is backend for FB Messages. Either source or sink for Hadoop jobs. â€ª#ToHUGâ€¬
- HBase is good for locality when used with Hadoop because data is stored near where it’s processed. â€ª#ToHUGâ€¬
- HB table consists of regions which consist of 1+ column family. Regions are the storage unit. Really good for sparse dbs. â€ª#ToHUGâ€¬
- Sparse=lots of empty fields=columns mostly empty=varying numbers of columns. â€ª#ToHUGâ€¬
- @ianhakes Ha! How about I drop the iPad off in your old office (aka mine)? 😉 â€ª#ToHUGâ€¬
- 1 column family is stored as 1 HFile. â€ª#ToHUGâ€¬
- HBase CRUD=Put Get Scan Delete â€ª#ToHUGâ€¬
- Google BigTable uses com-google, com-google-images, com-ibm, etc as keys for efficient scantables. Similar to what you want in Hbase â€ª#ToHUGâ€¬
- HMaster talks to HRegionServers contains HLog and HRegion contains MemStore contains HFile talks to DFS Client talks to DataNodes â€ª#ToHUGâ€¬
- ZooKeeper(s) stores config, logs, determines HMaster for HMaster, clients â€ª#ToHUGâ€¬
- HBase compaction=force write to disk. Relates to locality. â€ª#ToHUGâ€¬
- If you’re smart, don’t integrate HBase and Hive. Hive is map-reduce SQL job, HBase is a database. Hive best for regular HDFS data â€ª#ToHUGâ€¬
- HBase replication assumes column family exists in both clusters. No config required in child cluster. â€ª#ToHUGâ€¬
- BTW, I’m subbing adjective “child” for adj “slave” as a stylistic preference. â€ª#ToHUGâ€¬
- ZooKeeper quorum doesn’t have to be the same for HB replication. ZK quorum is an odd number of ZKs that agree on something. â€ª#ToHUGâ€¬
- HBase Coprocessor Observers=database triggers â€ª#ToHUGâ€¬
- HBase Co-pro Endpoints=stored procedures. In Java. Custom RPC protocol. Invoked by client on row or rowset. â€ª#ToHUGâ€¬
- HBase security has auth per table, per column family, per column qualifier. Stored in_acl_ table. â€ª#ToHUGâ€¬
July 18, 2012
New Features in Hadoop 2.0 session at ToHUG tonight

I’ll be attending the New Features in Hadoop 2.0 – HA, FNN, HBase Coprocessors info session tonight. The session is being organized by the Toronto Hadoop User Group. It’s at 7pm near King East and Parliament in Toronto.

Feel free to say hello if you see me there.

Cheers,

Leons

July 17, 2012
How to access web console from Greasemonkey userscripts
Userscripts are helper Javascript programs that you can add to your browser to automate and optimize the web pages you visit. Greasemonkey is a Firefox extension to run userscripts. The web console is a tool built into Firefox and other browsers that can be helpful during userscript development.

How can you access the web console from Greasemonkey userscripts? First of all, you should not do this in production. However, logging to the web console can be a useful tool during development.

Here’s how you do it:
```
// Put this at the top of your userscript
var console = unsafeWindow.console;
```
Bonus: How to access your userscript’s jQuery from the web console
```
// Expose userscript's jquery to the web console
unsafeWindow.$ = $;
```
Again, do not do this in production.
May 8, 2012
Userscript for faster deletion of MediaWiki spam
A couple weeks ago I posted a userscript that makes banning MediaWiki spammers easier by setting good defaults for the user ban form. Since then, I’ve had to ban a lot of spammers, so I thought I should remove another point of friction.

For some reason, MediaWiki chooses to not provide direct deletion links on the User Contributions page, so after banning a spammer you have to click through to each piece of spam before deleting it. This may have been an acceptable user experience in Web 1.0 days, but it’s a ridiculous set of hoops to jump through today.

My goal is to eventually make banning a spammer and deleting all the spam they’ve posted a one-click process. If there’s an existing solution for this, I’d love to hear about it.

Since I needed to do some DOM manipulation, I chose to use jQuery in the userscript. This was also a lot easier than I might have expected. Userscripts really are a very solid technology.
```
// ==UserScript==
// @name           Mediawiki - Fast delete on user contributions view
// @namespace      https://userscripts.org/users/457667
// @description    Adds a delete link for every page on the user contributions view
// @include        http://example.com/index.php/Special:Contributions/*
// @require        https://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js
// ==/UserScript==

// Add page delete links to the user contributions page
function enhanceHistory() {
  "use strict";

  // Find the history link for each revision
  $('#bodyContent ul li')
    .find('a[href*="action=history"]')
    .each(function(i) {
      // Append a page delete link after each history link
      var url = this.toString().replace('action=history', 'action=delete');
      $('<span> | </span><a href="' + url + '">del</a>').insertAfter($(this));
  });
}

enhanceHistory();
```
I passed the above through JSHint to make sure there’s nothing silly in it, but I haven’t consulted the jQuery style guide so it may not conform to the usual formatting.

The page delete links still lead to a standard confirmation form, so this doesn’t violate the RESTful practice of only using links for reading content rather than changing it.

Here’s a screenshot of what it adds to the user interface:
May 7, 2012

Userscript to make banning MediaWiki spammers easier

Somehow, I’ve come to be responsible for administering two MediaWiki-powered wikis. The main burden is having to ban spammers, which sometimes sign up in batches of 20 at a time.

To help with process, I’ve put together the following browser userscript. On Firefox, you can easily set it up using the Greasemonkey extension. Opera and Chrome have their own facilities.

The script basically makes the default values on the user ban form sane, so I can just click through without fiddling with dropdown and checkboxes. Obviously, the ban has to be permanent. Obviously, I don’t want spammers emailing anyone.

// ==UserScript==
// @name           Blocker
// @namespace      https://userscripts.org/users/457667
// @include        http://example.com/index.php/Special:Block/*
// ==/UserScript==

// Set default expiry to 'infinite' or 'indefinite', depending on MediaWiki version
function makeExpiryInfinite () { "use strict";
  // Get the element
  var elExpiry = document.getElementById('wpBlockExpiry');
  if (!elExpiry) { elExpiry = document.getElementById('mw-input-wpExpiry'); }

  // Abort if element not found
  if (!elExpiry || !elExpiry.children) { return; }

  // Find the infinite option
  var expiryNodes = elExpiry.children;
  var index = 0;
  for (var i in expiryNodes) {
    if (expiryNodes[i].label && expiryNodes[i].label in {infinite:1, indefinite:1}) {
      index = i;
    }
  }

  // Set dropdown to the infinite option
  elExpiry.selectedIndex = index;
}

// Automatically prevent user from sending e-mail
function preventEmail() { "use strict";
  // Find check box
  var elEmailBan = document.getElementById('wpEmailBan');
  if (!elEmailBan) { elEmailBan = document.getElementById('mw-input-wpDisableEmail'); }

  // Abort if it's not there
  if (!elEmailBan) { return; }

  // Check the box
  elEmailBan.checked = true;
}

// Automatically prevent user from sending e-mail
function reallyBanThatIP() { "use strict";
  // Find check box
  var elHardBlock = document.getElementById('mw-input-wpHardBlock');
  if (!elHardBlock) { elHardBlock = document.getElementById('mw-input-wpHardBlock'); }

  // Abort if it's not there
  if (!elHardBlock) { return; }

  // Check the box
  elHardBlock.checked = true;
}

makeExpiryInfinite();
preventEmail();
reallyBanThatIP();

I’ve been hearing wonderful things about userscripts for years, but this is the first one I’ve put together for myself. It’s actually very easy to write these, assuming you know Javascript and have tools like Firefox’s Web Console and the Web Developer extension handy.

I’m planning to enhance it a little so that it handles the slightly different form for banning anonymous users, but I’m not sure if it makes sense to submit to any official repository. It helps with running small wikis that have open memberships, so there isn’t any one site I can identify it with. Obviously, it’s not suited for Wikipedia, as they have a very different set of problems.

Edit: Updated the script with better handling for banning anonymous users by IP address.

April 20, 2012

Setting up a fresh Windows system
I’m setting up a new primary system and I thought I’d jot down some notes.

Ninite is probably the quickest way to install all the necessary software (Chrome, 7-zip, Dropbox, iTunes, Picasa, etc). You click the checkboxes and it rolls you a custom, hands-off installer. They upsell to an auto-update service, but there are free alternatives like FileHippo Update Checker.

Speaking of Dropbox, it proved a lifesaver. If you aren’t familiar with it, it’s a service that automatically syncs (and backups) a folder between all your machines. My last hard drive failed, but because all my personal files are on Dropbox I didn’t lose any of them. They have a free 2GB account available, and if you join they’ll toss some extra free space my way as well.

I also set up a few Firefox extensions. With extensions, the goal is always is to have as few as possible, as there is a history of extensions slowing down Firefox performance. Here are the ones I chose:
- AdBlock Plus
- HTTPS Everywhere
- LastPass
- TabSubmit
- Tree Style Tab
- Xmarks
HTTPS Everywhere is of course a great security boon. LastPass is a secure cross-browser way to manage the hundreds of passwords we all have. Xmarks is a bookmarks synchronizer which I prefer over Firefox Sync because it’s cross-browser.

I have a few goals with the new system:
- Keep the desktop empty of files
- Keep all personal files in a single location (e.g. Dropbox)
- Keep all work files in a single location (e.g. Projects)
The last one might be the trickiest, as all the different Eclipse-based IDEs I’ll need to install will all try to grab a workplace for themselves.
March 15, 2012