http://www.personalizemedia.com/media/socmedcounter.swf
Here’s an interesting example from the InfoSphere BigInsights/Hadoop class I’m attending right now. Note that Social, Mobile, Games, and Heritage are tabs that you can switch between.
http://www.personalizemedia.com/media/socmedcounter.swf
Here’s an interesting example from the InfoSphere BigInsights/Hadoop class I’m attending right now. Note that Social, Mobile, Games, and Heritage are tabs that you can switch between.
My work for the past couple years has been to develop DB2 images and templates for various cloud platforms and to engage the DB2 community online. This is still the case, but increasingly I’m spending my time working with IBM InfoSphere BigInsights.
BigInsights is IBM’s distribution of Hadoop.
What’s Hadoop? It’s a great way to crunch through massive amounts of unstructured data like email archives, geographic stuff, economic measurements, and so on to find interesting patterns. It rests on the Map-Reduce algorithm, which is what Google uses when you search. Much Google’s success rests on Map-Reduce’s ability to scale out on commodity hardware.
(Notably, the whole Cloud Computing thing is the flip side of using massive arrays of commodity hardware. Since you have so much of it, you need a way to automate and abstract the management as much as possible. Since you’ve automated and abstracted away management, you might as well sell it as a service.)
Hadoop itself is an Apache Software Foundation project nurtured by Yahoo among others. It’s gaining an increasing number of commercial distributions including Cloudera, IBM, and now Hortonworks.
You can quickly try out BigInsights Basic on the cloud or download it to your own machine.
I really do recommend the macro that
[10:05] At Hack/Reduce. It’s at a nice loft-style office space downtown.
[10:10] Installing git, msysGit, Tortoisegit, and also the git package in cygwin on the off chance one of them proves useful.
[10:30] They have Bixi usage, ocean carbon measurements, a Wikipedia dump and other sample datasets.
Hey, wasn’t there an interesting theorem in an XKCD comic recently about Wikipedia…
Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at “Philosophy”. – XKCD
[10:45] Some guy: I want to work on calculating the Philosophy-distance for every article on Wikipedioa.
[10:50] Me too.:P
[11:23] Installing Eclipse, either Gradle or Ant or both, and for the hell of it a Windows PATH editor.
[11:38] OK, OK, downloading Ubuntu. I have VMWare Workstation so I’m going to skip VirtualBox.
[11:39] On further thought, I already have 2 Ubuntu vms on this computer that I haven’t used in a year or two. I’ll just use one of them.
[12:31] I’m team 14 running stuff on cluster 4. Working solo, since that’s the way I roll. 😛
[12:35] Contrary to the organizers, I have it running on Windows with no problems. I’ll post instructions for getting it working on Windows as I go.
[13:26] A few notes for running it on Windows:
[13:34] Yum, free pizza
[14:10] OH HAI GUYZ. I HEAR U LIEK REGEX. HEERS SUM REGEX.
A (likely incomplete) regular expression for finding out the target of an internal Wikipedia link:
Notes:
Let’s decompose the line noise:
Oh, and since this is Java, we have to double up the escape slashes:
[14:50] After a long while of running my job on my local machine, I’m now running it on the cluster against the full dataset. Whee!
[15:51] Continuing to work at it. Talked the idea over with one of the mentors, who suggested this approach:
[16:31] I think I should wrap this up soon.
[16:34] Here’s a list of all Wikipedia articles that are 1 link away from Philosophy.