Machine learning with Mahout and Hadoop session

Written by

Tonight I attended a session about machine learning with Mahout at BNotions. The session was organized through the Toronto Hadoop User Group.

Quick Notes

BNotions uses Hadoop and Mahout for their Vu mobile app. Vu is a smart news reader that recommends articles based on article similarity to things you like as well as user similarity to you.
Graph theory and graph processing algos are helpful for this work.
Likes, dislikes, reads, skips are the most important input for their machine learning. Also relevant: user preference for breadth of topics vs depth; recency; natural language processing to extract topic keyword and organize topics by similarity.
Redis is used for transient storage. It has some useful ops above just key-value. They use S3 as a data warehouse, but it could just as easily be HDFS.
They use Amazon EMR as the Hadoop cluster. EMR constrains technology choice. For example, harder to use HDFS, hence Redis instead. They are evaluating HBase as an alternative — performance differences not relevant for use case.
They don’t currently adjust for article length as factor in recommendations.
They use a third party API for NLP, not Hadoop specidically. Only once per article, so not a bottleneck yet. Not happy with NLP quality, though.
Cascalog/JCascalog to query the Hadoop data using Scala.
Scalability is limited by cost, not capability. May switch from EMR to dedicated cluster, etc as cost grows.
Data science 10%, engineering 90%. Stock algos for rapid application development, tweak after. Deployment (my own specialty!) can be painful.
Service-oriented architecture (SOA) helps with deployment. Simplifies components, but adds a devops layer. Jenkins is used to automate builds.

Machine learning with Mahout and Hadoop session

Comments

Leave a Reply Cancel reply

More posts

Migrations

Joining Upgrade

Join or union ranges from multiple tabs in Google Sheets

@here considered harmful