Datathon For Diabetes in Boston

This weekend Brandon and I are at the Datathon for Diabetes in Boston. It starts tonight at 5 and goes all day Saturday. The goal is to use publicly available data to generate an insightful and innovative analysis of diabetes in United States and abroad.

Datathon for DiabetesFitBit Charge HR prize at Datathon for Diabetes

We’re sponsoring a prize for the team that makes best use of Data Scientist Workbench in their solution. Novo Nordisk and Deloitte are also sponsoring a prize each.

Our prize consists of a FitBit Charge HR for each member of the winning team.
I think it’s worthwhile to learn and apply Spark as a tool to the problem of diabetes. Spark is an open source framework that lets you run your data analysis in parallel on multiple machines for speed and ability to work with large amounts of data.

Data Scientist Workbench has Spark ready to use with Python, Scala, and R in Jupyter, Zeppelin, and R Studio IDE.

If you run into trouble at the datathon, come up and ask me any question you like. I’ll be there for the duration as a mentor. As always, if you run into a Data Scientist Workbench issue, you should also open a support ticket.

Open a Data Scientist Workbench support ticket for any issues at Datathon for Diabetes

Other events

May 11-12 is Datapalooza Beijing and May 19 is Datapalooza Denver. Also, Big Data University is now posting events on its Facebook page.

Datathon for Diabetes, Boston

Ottawa: Data Day 3.0 at Carleton

On Tuesday March 29, I’ll be demoing Data Scientist Workbench (DSWB) at Data Day 3.0 for the Carleton University Institute of Data Science.

I’m in Ottawa the weekend before, so feel free to ping me and connect. I’m on Twitter as @leonsp.

Data Scientist Workbench hosts open source data science tools for you for free. The tools include Jupyter and Zeppelin notebooks for developing and documenting your algorithms, R Studio IDE for focusing on your R code, and OpenRefine for cleaning your data.

Data Day 3.0Data Day 3.0 takes place at Carleton University

The event  is organized by the Carleton University Institute of Data Science in Ottawa. It runs from 8am to 3:30pm in the River building. You can find more details on their event page.


Spark Summit East 2016

Next week I’ll be demoing Data Scientist Workbench at Spark Summit East (official site) in New York. Polong Lin will be there with me. Come by the expo floor next Wednesday and Thursday and chat with us.

Data Scientist Workbench is what my team builds. It hosts open source data science tools like Jupyter, OpenRefine, R Studio IDE, Zeppelin and others for you. There’s exciting stuff in the changelog every week.

I signed up in time to get into a training session at Spark Summit East, so I’ll be spending my Tuesday working with the Wikipedia data sets. In today’s industry jargon, I’m more of a data engineer than a data scientist, so I’m hoping my Spark skills are up to the level needed for the advanced course.

This week I’m at Datapalooza Seattle, which is a good opportunity to brush up and expand those same Spark skills. In fact, we just posted the Day 1 challenge for Datapalooza. If you’re following along at home, fire up your Data Scientist Workbench, open a Jupyter notebook, and give it a try.

Spark Summit East


Datapalooza Seattle on Feb 9-11

On February 9 through 11, I’ll be mentoring hackers and budding data scientists at Galvanize during Datapalooza Seattle. It should be a great conference covering topics like things like machine learning, natural language processing, and data engineering infrastructure.

Last year’s Datapalooza in San Francisco was a fantastic event with lots of in-depth sessions. I was impressed with the range of material on data science and data engineering. The upcoming Datapalooza Seattle looks equally as fascinating.

My team at work runs  Data Scientist Workbench which is free hosted suite of open source tools including Jupyter, Zeppelin, R Studio IDE, and OpenRefine. We also organize free data science education through Big Data University.

I’m expecting Antonio Cangiano, Polong Lin, and Leon Katsnelson to be at Datapalooza with me as fellow mentors.

Let me know if you’re in Seattle at the same time and we’ll connect.

Datapalooza Seattle

Adobe password breach as the world’s greatest crossword puzzle

Adobe was recently breached and 150,000,000 user accounts were stolen. Adobe was following the one of the worst practices of password storage — reversible encryption (rather than hashing with a salt using a good, slow algorithm like bcrypt). A very, very old throwaway password of mine was among those leaked.

XKCD has referred to this breach as The Greatest Crossword Puzzle in the History of the World!

It was bound to happen eventually. This data theft will enable almost limitless []-style password reuse attacks in the coming weeks. There's only one group that comes out of this looking smart: Everyone who pirated Photoshop.

With the help of LastPass’ Has Adobe Leaked My Password, let me illustrate why:

The following hints have been used by other people that share your password. This information could be used to determine your password as well.

  • Life, Universe, Everything
  • life?
  • DA
  • h2g2
  • hitchiker’s guide to the galaxy
  • yes
  • meaningoflife
  • theusual
  • everything
  • hitchhiker
  • dolphins
  • gta
  • a4
  • answer
  • meaning?
  • life
  • the answer
  • the question of life
  • meaning of life
  • the usual
  • life..
  • life the universe and everything
  • a2lae
  • the ultimate
  • Hitchhiker
  • What’s the answer?
  • hitchhikers?
  • Life the Uni and Every
  • life meaning and flower
  • common
  • douglas adams
  • a?
  • maiden
  • lotr no #
  • Adams question
  • Hitchhiker’s Guide
  • answer?
  • question
  • Life Meaning
  • adams
  • life universe everything
  • the number
  • towel
  • typical
  • The Usual
  • How many roads must a man walk down?
  • Life, the universe, and everything
  • What is the meaning of life, the universe and all?

Would you care to guess what password the naive, young me used for Adobe?

Next steps

IBD-3475A Crunch Big Data in the Cloud with IBM BigInsights and Hadoop

I’m teaching a hands-on lab at Information on Demand 2013. I will edit the post to include lab materials closer to the date.

Session: IBD-3475A Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Time: Thu, 7/Nov, 10:00 AM – 01:00 PM
Location: Mandalay Bay South Convention Center – Shorelines B Lab [Room 15]

First step

Please request a lab environment. We will use a Hadoop environment hosted in the cloud. Each attendee will be provided with a personal environment.

Lab materials

Posting more frequently

I’ve been really enjoying Rafe Colburn’s technical blog since he made his pledge to post more frequently. It makes a lot of sense for a technical blog to also have linkblogging with brief commentary within the same stream of content. I would argue that the appeal of sites like Reddit and Hacker News relates to people doing the same en masse.

Naturally, I’ve also been doing some techie linkblogging on my Twitter account.


On November 6, 2012, I’m teaching a hands-on lab at CASCON together with Bradley Steinfeld and Marius Butuc. The lab is called Crunching Big Data with Hadoop and BigInsights in the Cloud. The lab is based on the Hadoop Fundamentals course at Big Data University.


1.0 Welcome
1.1 What is Big Data?
1.2 Lab Setup
Setup Lab
Setup Lab (PDF Download)
1.3 What is Hadoop?
1.4 Hadoop Architecture – HDFS
Lab (PDF Download)
1.5 Hadoop Architecture – MapReduce
MapReduce Lab
Lab (PDF Download)


1.6 Pig, Hive, and Jaql
Pig, Hive, and Jaql Lab
Lab (PDF Download)
1.8 Working with BigInsights
Web Console Lab
Web Console Lab (PDF Download)
Data Discovery with BigSheets

Module 1.7 covers Flume. It’s available for free on Big Data University.

Dehacking this blog

The first rule of security is to, of course, assume everything is compromised. If some code is compromised, everything is compromised. The correct response to a hacked WordPress is to nuke all the code.

My WordPress installation was recently compromised. There’s a limit to how far I can apply the principle because this particular WordPress is currently on shared hosting, but all code I have access to is now nuked. WordPress has been reinstalled from scratch, and all the various hanger-on sites that had accumulated in the same hosting account are now no more.

I’ve also adopted the pertinent steps from My WordPress Site Was Hacked, Hardening WordPress, and the Ultimate Security Checker plugin (guide).

Last line of defense:

grep base64_decode -R *
grep gzinflate -R *

The attack’s objective was to inject PHP code into various pages. The code was obfuscated via a double pass through those two functions. The two shell commands above will show any instances of those two functions.