IBD-3475A Crunch Big Data in the Cloud with IBM BigInsights and Hadoop

I’m teaching a hands-on lab at Information on Demand 2013. I will edit the post to include lab materials closer to the date.

Session: IBD-3475A Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Time: Thu, 7/Nov, 10:00 AM – 01:00 PM
Location: Mandalay Bay South Convention Center – Shorelines B Lab [Room 15]

First step

Please request a lab environment. We will use a Hadoop environment hosted in the cloud. Each attendee will be provided with a personal environment.

Lab materials

Machine learning with Mahout and Hadoop session

Tonight I attended a session about machine learning with Mahout at BNotions. The session was organized through the Toronto Hadoop User Group.
Quick Notes
  • BNotions uses Hadoop and Mahout for their Vu mobile app. Vu is a smart news reader that recommends articles based on article similarity to things you like as well as user similarity to you.
  •  Graph theory and graph processing algos are helpful for this work.
  •  Likes, dislikes, reads, skips are the most important input for their machine learning. Also relevant: user preference for breadth of topics vs depth; recency; natural language processing to extract topic keyword and organize topics by similarity.
  •  Redis is used for transient storage. It has some useful ops above just key-value. They use S3 as a data warehouse, but it could just as easily be HDFS.
  •  They use Amazon EMR as the Hadoop cluster. EMR constrains technology choice. For example, harder to use HDFS, hence Redis instead. They are evaluating HBase as an alternative — performance differences not relevant for use case.
  •  They don’t currently adjust for article length as factor in recommendations.
  •  They use a third party API for NLP, not Hadoop specidically. Only once per article, so not a bottleneck yet. Not happy with NLP quality, though.
  •  Cascalog/JCascalog to query the Hadoop data using Scala.
  •  Scalability is limited by cost, not capability. May switch from EMR to dedicated cluster,  etc as cost grows.
  •  Data science 10%, engineering 90%. Stock algos for rapid application development, tweak after. Deployment (my own specialty!) can be painful.
  •  Service-oriented architecture (SOA) helps with deployment. Simplifies components, but adds a devops layer. Jenkins is used to automate builds.

Have bash warn you about uninitialized variables with set -u

By default, Bash treats uninitialized variables the same way as Perl — they are blank strings. If you want them treated more like Python, you can issue the following command in your bash script:

set -u

You will then start seeing warning messages like the following:

./my_script.sh: line 419: FOO_BAR: unbound variable

Note that this mean you can’t check for the non-existence of environment variables with a simple [[ -z “$ENVIRONMENT_VARIABLE” ]]. Instead, you could do something like the following:

[[ $( set | grep "ENVIRONMENT_VARIABLE=" | wc -l ) -lt 1 ]]

 

Set PuTTY defaults, permanently

PuTTY or one of its forks is a standard tool for administering Unix and Linux machines from Windows. It provides SSH connectivity for command line access, as well as keypair management for compatible programs like WinSCP.

Unfortunately, PuTTY has some terrible defaults. For example, it limits itself to 200 lines of scrollback by default, which guarantees that you’ll lose some history in most SSH sessions.

There’s a way to fix this and other defaults.

First, load the “Default Settings” saved session:1-load-default Then, configure the defaults as you like. For example, I’m increasing my lines of scrollback from 200 to 20,000: 2-configure

Then, save the new default settings:

3-save-default

PuTTY will now have a sensible defaults whenever you’re connecting to a random server.

Hardening WordPress against the ongoing brute-force attack

There’s an ongoing brute-force attack against WordPress and Joomla sites. The attack tries to brute-force the admin password. (Reddit)

I had to harden my WordPress some time ago. Here are the guides I followed when hardening my installation:

Additional steps I’ve taken today:

Alternatives to Gmail?

Now that I’ve moved from Google Reader to Fever, I’d like to reduce my reliance on other Google services. Switching from Google search to Bing is pretty easy, but I’m on much less sure ground when it comes to replacing Gmail.

Requirements:

  • Paid service (If you aren’t paying, you are the product, not the customer)
  • Search-driven interface
  • Reasonable limits on message and mailbox size

I’ve heard of HushMail. Is there anything else worthwhile?

Edit: HushMail is a no-go. It doesn’t have a way to set up a filter or rule to automatically file incoming mail.

Migrating from Google Reader to Fever

The perfidious vandals at Google will kill Google Reader on July 1, 2013. Accordingly, it is time to wean ourselves off Google dependence and find an alternative. Perhaps this will prove to be a good thing, as Google Reader has strangled RSS innovation through its monopolist, good-enough position much like IE6 once strangled the web.

NewsBlur and The Old Reader are two services I’ve seen mentioned. Unfortunately, both are currently buckling under the load of my fellow reader-heads fleeing the sinking Google ship. (Edit: More alternatives are listed in the roundups at Kikolani and LifeHacker.)

Accordingly, I’ve just installed Fever on my shared hosting. I’m not going to recommend my hosting provider as my account is based on a grandfathered plan, but Dreamhost is popular. The more technically inclined may want to spin up an Amazon EC2 instance.

Fever is a PHP/MySQL web application. It’s very easy to install, assuming you have access to a web server. It costs a one-time $30, which is likely why it is very easy to install. It also comes with lots of really neat features that innovate beyond what Google Reader ever did, none of which I care about.

Migrate from Google Reader to Fever

  1. Log into Google Takeout.
  2. Download your Google Reader data.
  3. Unzip it. The subscriptions.xml file contains your feeds and folders in standard OPML format.
  4. Download the Fever Server Compatibility Suite
  5. Upload it to your server and let it verify compatibility.
  6. Is it compatible? Great! Paypal over the $30.
  7. Copy the activation code from the email in your inbox into the wizard.
  8. Let the wizard install Fever for you, importing your precious subscriptions.xml.
  9. Fever will display a brisk progress bar as it quickly processes your myriad feeds.
  10. Oh, you may want to enable the unread messages count:
Enable unread messages count

Voila!

My fever feeds

Fix VPN hostname resolution by flushing your DNS cache

Sometimes when my VPN connection to work goes down, certain applications that rely on intranet servers (e.g. Lotus Notes, Lotus Sametime) become unable to reconnect to their servers even after I reconnect to VPN. This is due to the operating system’s DNS lookup cache reusing the failed hostname lookup from when VPN was down rather than doing a fresh hostname lookup now that there is a fresh VPN connection.

On Windows, you can fix the issue by opening up the Command Prompt as Administrator and running the following command:

ipconfig /flushdns

mkdir -p is your friend

mkdir -p is a command second only to touch in succinct utility.

touch creates a file if it does not exist, or updates its timestamp if it does. It’s handy if you want to write to a file without checking for its existence, as otherwise you’d need to determine whether or not append is the correct mode. It’s also handy for setting flags for yourself on the filesystem.

mkdir -p creates a path if it does not exist, or does nothing if the path already exists. mkdir -p /foo/bar/baz will create /foo, /foo/bar, and /foo/bar/baz for you. Conversely, mkdir -p /usr/local/bin will not complain because those directories already exist.

Why would you need this? A couple reasons that came up for me tonight:

  • You cannot redirect output to a file if the file is in a directory that does not yet exist
  • You cannot create a symbolic link in a directory that does not yet exist