Guest post: Day 1 at the O’Reilly Strata Conference

February 2, 2011
9:34 am
by Andy Kirk

This is a guest/cross-post by Jan Willem Tulp, the winner of my recent contest to win a full pass to the O’Reilly Strata conference. The conference is taking place this week and Jan has kindly offered to share a short summary on each day of the conference. You can find out more about Jan’s work via his blog and follow him on Twitter @JanWillemTulp.

Day 1 at O’Reilly Strata Conference: Data Bootcamp
Today was my first day at O’Reilly Strata Conference: a full day of tutorial sessions. The session I picked was the Data Bootcamp by Joseph Adler (LinkedIn), Hilary Mason (bit.ly), Drew Conway (New York University) and Jake Hofman (Yahoo!). The purpose of this bootcamp tutorial was to turn everybody in the room into data scientists by getting our hands dirty with some real hands-on experience.
The tutorial was kicked-off with an introduction of the speakers, and a general overview of the various aspects of working with data: getting data, cleaning data, applications of data intensive applications, and much more. Then Drew gave an interactive introduction in visualizing data using Python and R. The audience had to produce a normal-distribution of random numbers in R. And although some people managed to get along with all the examples, there were also lots of people struggling due to the fact that libraries were missing, or simply for the fact that everything was going pretty fast, at least for R and Python newbies like myself.
Next Jake gave an great introduction into image processing, and especially how you can cluster images based on similar features, color in our case. We used a K-Means clustering algorithm to cluster similar images based on color, and after that we classified images, whether they were images of landscapes or head-shots.
After the bream Hilary took over with a great presentation of working with text-data. Starting with some basic examples of extracting data from webpages using command-line commands like curl and wget, and using Python and the BeautifulSoup Python library. After that we turned to the main example: ‘hacking’ a gmail account, and try to get some valuable information out of it. Hilary showed us how to classify email using probability statistics, and then Drew took over to show us how to visualize this data and turn it into network diagrams.
Last but not least Joseph gave a talk about Big Data. This was not an interactive session. Joseph shared some of his knowledge and experience of working with big data at LinkedIn, and explained the basics of Map/Reduce, Hadoop, and why and when to start thinking about big data solutions like Hadoop.
Overall it was an interesting day, also because I’ve met really great people. For me the Data Bootcamp was especially a inspirational tutorial with lots of ideas to try out on my own. For some people tempo tempo was a little to high, especially if you’ve never programmed R or Python before. And becoming a Data Scientist in just 1 day may be an illusion anyway. At least the tutorial gave me a good head start, lots of inspiration, and great learnings of how the presenters approach working with data. So for me, this was a great and successful first day, and I’m looking forward to the next two days!
The source code and slides of the Data Bootcamp are available online at: https://github.com/drewconway/strata_bootcamp