I’ll be teaching two hands-on labs at Insight 2015 in Las Vegas:
LCD-3459 Introduction to Data Science
Data science is a very popular job profile and in great demand in a wide variety of industries. You no longer need a Ph.D. in mathematics or statistics to become a data scientist. Any data professional can upgrade their skills and study data science. This lab introduces the basic concepts of data science and provides hands-on examples to help you apply these concepts.
LCD-3479 Fundamentals of Spark
Spark is one of the most important technologies for big data analytics and Spark skills are in great demand. This lab session introduces Spark fundamentals and applies the concepts using hands-on examples with a Spark cluster in cloud. You can also download a Docker image to your own laptop and run the lab projects there.
I’ll be in San Francisco this weekend helping run the Apache Spark hackathon, and afterwards I’ll be at Spark Summit 2015.
If you’re curious at all about Spark, you should come out and hack with us. We’ll have some fun data sets and help you find a team.
You can take the free Spark Fundamentals course on Big Data University to brush up on your Spark skills. Spark is a framework for fast in-memory and batch analytics processing. It’s algorithmically smarter and so a lot faster than traditional Hadoop.
“Apache Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide, Spark is the alternative. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining.”
Tonight I attended a session about machine learning with Mahout at BNotions. The session was organized through the Toronto Hadoop User Group.
BNotions uses Hadoop and Mahout for their Vu mobile app. Vu is a smart news reader that recommends articles based on article similarity to things you like as well as user similarity to you.
Graph theory and graph processing algos are helpful for this work.
Likes, dislikes, reads, skips are the most important input for their machine learning. Also relevant: user preference for breadth of topics vs depth; recency; natural language processing to extract topic keyword and organize topics by similarity.
Redis is used for transient storage. It has some useful ops above just key-value. They use S3 as a data warehouse, but it could just as easily be HDFS.
They use Amazon EMR as the Hadoop cluster. EMR constrains technology choice. For example, harder to use HDFS, hence Redis instead. They are evaluating HBase as an alternative — performance differences not relevant for use case.
They don’t currently adjust for article length as factor in recommendations.
They use a third party API for NLP, not Hadoop specidically. Only once per article, so not a bottleneck yet. Not happy with NLP quality, though.
Cascalog/JCascalog to query the Hadoop data using Scala.
Scalability is limited by cost, not capability. May switch from EMR to dedicated cluster, etc as cost grows.
Data science 10%, engineering 90%. Stock algos for rapid application development, tweak after. Deployment (my own specialty!) can be painful.
Service-oriented architecture (SOA) helps with deployment. Simplifies components, but adds a devops layer. Jenkins is used to automate builds.
The workshop will be an all day hands-on introduction to Hadoop, HDFS, MapReduce, Hive, and JAQL. The plan is to have ready Hadoop clusters running in the cloud for the various exercises.
Hadoop is a parallelized data processing framework. It lends itself very nicely to running in cloud environments like Amazon EC2 and IBM SCE, as the core concept is to split sophisticated queries across clusters of commodity hardware. On a basic level it’s an implementation of MapReduce in Java, but a great many tools in its eco system make it easy to formulate and execute queries on the fly.
The material will have some things in common with the free Hadoop Fundamentals course you can take onÂ Big Data University today, though naturally adapted for the CASCON themes and with added hands-on instruction.