O’Reilly Strata Conference, Santa Clara 2012 (20% reader discount)

December 9, 2011
9:02 am
by Andy Kirk

Once again it’s my pleasure to be supporting the third O’Reilly Strata Conference, which will be taking place in Santa Clara, California between February 28 and March 1 (don’t forget it’s a leap year!). Described as the ‘the home of data science’, the conference brings together some of the very best developers, data scientists, data analysts, CIOs and IT professionals who are driving the data revolution.

As with previous events (in September and the launch event in February) Visualising Data readers can benefit from a 20% discount on their registration fees, using the code VIZDATA.
So why should you attend? Quite simply there is a fantastic line up of speakers across all topic tracks but especially the ‘Visualization and Interface’ one, with names like Noah Illiinsky, Simon Rogers, Jock MacKinlay, Ben Goldacre, Pete Warden, Hal Varian and Max Gadney found within the excellent schedule. Here’s my proposed timetable…

Tuesday, 28th February

9:00 Designing Data Visualizations Workshop

Noah Iliinsky (Complex Diagrams)
We will discuss how to figure out what story to tell, select the right data, and pick appropriate layout and encodings. The goal is to learn how to create a visualization that conveys appropriate knowledge to a specific audience (which may include the designer).
We’ll briefly discuss tools, including pencil and paper. No prior technology or graphic design experience is necessary. An awareness of some basic user-centered design concepts will be helpful.
Understanding of your specific data or data types will help immensely. Please do bring data sets to play with.

13:30 (1) Hands-on Visualization with Tableau

Jock Mackinlay (Tableau Software), Ross Perez (Tableau Software)
Data has always been a second class citizen on the web. As images, then audio, then video made their way onto the internet, data was always left out of the party, forced into dusty Excel files and ugly HTML tables. Tableau Public is one of the tools aiming to change that by allowing anyone to create interactive charts, maps and graphs and publishing to the web—no programming required.
In this tutorial you will learn why data is vital to the future of the web, how Tableau Public works, and gain hands-on experience with taking data from numbers to the web.
Through three different use cases, you will learn the capabilities of the Tableau Public product. The tutorial will conclude with an extended hands-on session covering the visualization process from data to publishing. Topics covered will include:

constructing a data set for best performance
formatting visualizations to match your preferred branding
designing charts for clear communication and impact

13:30 (2) The Craft of Data Journalism

Simon Rogers (Guardian)
Learn first hand from award-winning Guardian journalists how they mix data, journalism and visualization to break and tell compelling stories: all at newsroom speeds.

Wednesday, 29th February

8:45 Plenary

Welcome Edd Dumbill (O’Reilly Media, Inc. ), Alistair Croll (Bitcurrent)
The Apache Hadoop Ecosystem Doug Cutting (Cloudera)
Decoding the Great American ZIP myth Abhishek Mehta (Tresata)
A Big Data Imperative: Driving Big Action Avinash Kaushik (Market Motive)
Keynote by Ben Goldacre Ben Goldacre (Bad Science)

10:40 (1) Dealing With Messy Data

Q Ethan McCallum (independent)
Welcome to data science’s dirty little secret: data is messy. and it’s your problem.
It’s bad enough that data comes from myriad sources and in a dizzying variety of formats. Malformed files, missing values, inconsistent and arcane formats, and a host of other issues all conspire to keep you away from your intended purpose: getting meaningful insight out of your data. Before you can touch any algorithms, before you feed any regressions, you’re going to have to roll up your sleeves and whip that data into shape.
Q Ethan McCallum, technology consultant and author of Parallel R (O’Reilly), will explore common pitfalls of this data munging and share solutions from his personal playbook. Most of all, he’ll show you how to do this quickly and effectively, so you can get back to the real work of analyzing your data.

10:40 (2) Science of Visualization

Jock Mackinlay (Tableau Software)
Visual analysis is an iterative process for working with data that exploits the power of the human visual system. The formal core of visual analysis is the mapping of data to appropriate visual representations.
In this talk, you’ll learn: •What years of research by psychologists, statisticians and others have taught us about designing great visualizations •Fundamental principles for designing effective data views for yourself and others •How to systematically analyze data using your visual system.

11:30 Effective Data Visualization

Hjalmar Gislason (DataMarket)
Data visualization is often where people realize the real value in underlying data. Good data visualization tools are therefore vital for many data projects to reach their full potential.
Many companies have realized this and are looking for the best solutions to address their data visualization needs. There is plenty of tools to choose from, but even for relatively simple charting, many have found themselves with limited options. As the requirements pile up, options become limited: Cross-browser compatibility, server-side rendering, iOS support, interactivity, full control of branding, look and feel … and you’ll find yourself compromising, or – worse yet – building your own visualization library!
Building our data publishing platform – DataMarket.com – we’ve certainly been faced with the aforementioned challenges. In this session we’ll share our findings and approach for others to avoid our mistakes and learn from our – sometimes hard – lessons learned.
We’ll also share what we see the future of online data visualization holding: the technologies we’re betting on and how things will become easier, visualizations more effective, code easier to maintain and applications more user friendly as these technologies mature and develop.

13:30 Building a Data Narrative: Discovering Haight Street

Jesper Andersen (Bloom Studios)
Data isn’t just for supporting decisions and creating actionable interfaces. Data can create nuance, giving new understandings that lead to further questioning—rather than just actionable decisions. In particular, curiosity, and creative thinking can be driven by combining different data sets and techniques to develop a narrative around a set of data sets that tells the story of a place—the emotions, history, and change embedded in the experience of the place.
In this session, we’ll see how far we can go in exploring one street in San Francisco, Haight Street, and see how well we can understand it’s geography, ebbs and flows, and behavior by combining as many data sources as possible. We’ll integrate basic public data from the city, street and mapping data from Open Street Maps, real estate and rental listings data, data from social services like Foursquare, Yelp and Instagram, and analyze photographs of streets from mapping services to create a holistic view of one street and see what we can understand from this. We’ll show how you can summarize this data numerically, textually, and visually, using a number of simple techniques.
We’ll cover how traditional data analysis tools like R and NumPy can be combined with tools more often associated with robotics like OpenCV (computer-vision) to create a more complete data set. We’ll also cover how traditional data visualization techniques can be combined with mapping and augmented reality to present a more complete picture of any place, including Haight Street.

14:20 Crafting Meaningful Data Experiences

Bitsy Bentley (GfK Custom Research)
I am frequently asked for advice about using data visualization to solve communication problems that are better served through improved information architecture. A nicely formatted bar chart won’t rescue you from a poorly planned user interface. When designing meaningful data experiences it’s essential to understand the problems your users are trying to solve.
In this case, I was asked to take a look at a global data-delivery platform with a number of issues. How do we appeal to a broad cross-section of business users? How do we surface information to our clients in a useful way? How do we facilitate action, beyond information sharing? How do we measure success?
A user-centered approach allowed us to weave together a more meaningful experience for our business users and usability testing revealed helpful insights about how information sharing and data analysis flows within large organizations.
Data visualization is a powerful tool for revealing simple answers to complex questions, but context is key. User-centered design methods ensure that your audience receives the information they need in a usable and actionable way. Data visualization and user experience practices are not mutually exclusive. They work best when they work together.

16:00 (1) Netflix recommendations: beyond the 5 stars

Xavier Amatriain (Netflix)
Netflix is known for pushing the envelope of recommendation technologies. In particular, the Netflix Prize put a focus on using explicit user feedback to predict ratings. This kind of recommendation showed its value in the time when Netflix’s business was primarily mailing DVDs. Nowadays Netflix has moved into the streaming world and this has spurred numerous changes in the way people use the service. The service is now available on dozens of devices and more than 40 countries.
Instead of spending time deciding what to add to a DVD queue to watch later, people now access the service and watch whatever appeals to them at that moment. Also, Netflix now has richer contextual information such as the time and day when people are watching content, or the device they are using.
In this talk I will describe some of the ways we use implicit and contextualized information to create a personalized experience for Netflix users.

16:00 (2) Roll Your Own Front End: A Survey of Creative Coding Frameworks

Michael Edgcumbe (Columbia University), Eric Mika (The Department of Objects)
Many options exist when choosing a framework to build a custom data explorer on top of your company’s stack. With a brief nod to out-of-the-box business intelligence solutions, the presenters will offer an overview of the creative coding frameworks that lend themselves to data visualization on and across web browsers and native apps written for Mac OS X, iOS, Windows, and Android. Evaluation of the strengths and weaknesses of libraries such as Processing, OpenFrameworks, Cinder, Polycode, Nodebox, d3.js, PhiloGL, Raphael.js, Protovis, and WebGL will be explored through visual examples and code. The audience should come away with a sense of what investments into education will return a high value product that serves unique design goals.

16:50 Sketching With Data

Fabien Girardin (Lift Lab)
Since the early days of the data deluge, Lift Lab has been helping many actors of the ‘smart city’ in transforming the accumulation of network data (e.g. cellular network activity, aggregated credit card transactions, real-time traffic information, user-generated content) into products or services. Due to their innovative and transversal incline, our projects generally involve a wide variety of professionals from physicist and engineers to lawyers, decision makers and strategists.
Our innovation methods embark these different stakeholders with fast prototyped tools that promote the processing, recompilation, interpretation, and reinterpretation of insights. For instance, our experience shows that the multiple perspectives extracted from the use of exploratory data visualizations is crucial to quickly answer some basic questions and provoke many better ones. Moreover, the ability to quickly sketch an interactive system or dashboard is a way to develop a common language amongst varied and different stakeholders. It allows them to focus on tangible opportunities of product or service that are hidden within their data. In this form of rapid visual business intelligence, an analysis and its visualization are not the results, but rather the supporting elements of a co-creation process to extract value from data.
We will exemplify our methods with tools that help engage a wide spectrum of professionals to the innovation path in data science. These tools are based on a flexible data platform and visual programming environment that permit to go beyond the limited design possibilities industry standards. Additionally they reduce the prototyping time necessary to sketch interactive visualizations that allow the different stakeholder of an organization to take an active part in the design of services or products.

Thursday, 1st March

8:45 Plenary

Welcome Alistair Croll (Bitcurrent), Edd Dumbill (O’Reilly Media, Inc. )
Democratization of Data Platforms Jonathan Gosier (metaLayer Inc.)
Embrace the Chaos Pete Warden (Jetpac)
Keynote by Usman Haque Usman Haque (Pachube.com)
Using Google Data for Short-term Economic Forecasting Hal Varian (Google)
Keynote by Coco Krumme Coco Krumme (MIT Media Lab)

10:40 Video Graphics – Engaging and Informing

Max Gadney (After The Flood)
Videographics achieve the two most important criteria of the visualizer.
They engage attention and they inform.
I am currently working with the BBC to define a new format – that of the ‘Video Dat Graphic’. Some of these exist online to degrees of success but we are codifying best practice, auditing current activity and can show our work in the market context.
I will discuss how video is an information rich medium – from a survey of data resolution across media and how these videos can compliment the BBC online offering as a whole.
Some subjects to cover will be – storytelling principles – what actually works in 2 minutes – scripting and storyboarding – drafting a plan – timescales, costs and resources – designing for cognition – how video needs to understand how we perceive
I’ll be showing many examples in addition to our work.
This is a high paced session, with lots to look at and an excellent mix of storytelling and information design ideas. There is an excellent balance between theory and practical advice.

11:30 Rich Sports Data and Augmented Reality

Ryan Ismert (Sportvision, Inc)
Our presentation will cover the nascent fusion of automatically-collected live Digital Records of sports Events (DREs) with Augmented Reality (AR), primarily for television broadcast.
AR has long been used to in broadcast sports to show elements of the event that are otherwise difficult to see – the canonical examples are the virtual yellow “1st and 10” line for American Football and ESPNs KZone™ strike zone graphics. Similarly, sports leagues and teams have historically collected large amounts of data on events, often expending huge amounts of manual effort to do so. Our talk will discuss the evolution of data-driven AR graphics and the systems that make them possible. We’ll focus on systems for automating the collection of huge amounts of event data/metadata, such as the race car tracking technology used by NASCAR and the MLB’s PitchFX™ ball tracking system. We provide a rubric for thinking about classes of sports event data that encompasses scoring, event and action semantics metadata, and participant motion.
We’ll briefly discuss the history of these sports data collection technologies, and then take a deeper look at how the current first generation of automated systems are being leveraged for increasingly sophisticated analyses and visualizations, often via AR, but also through virtual worlds renderings from viewpoints unavailable or impossible from broadcast cameras. The remainder of the talk will examine two case studies highlighting the interplay between rich, live sports data and augmented reality visualization.
The first case study will describe one of the first of the next-gen digital records systems to come online and track players – Sportvision’s FieldFX™ system for baseball. Although exceeding difficult to collect, the availability of robust player motion data promises to revolutionize areas such as coaching and scouting performance analysis, fantasy sports and wagering, broadcast TV graphics and commentary, and sports medicine. We’ll show examples of some potential applications, and also cover data quality challenges in some detail, in order to examine the impact that these challenges have on the applications using the data.
The second case study will examine the rise of automated DRE collection as an answer to that nagging question about AR – ‘what sort of things do people want to see that way?’ Many of the latest wave of AR startups are banking huge amounts of venture money that the answer is in user-generated or crowd-sourced content. While this may end up being true for some consumer-focused mobile applications, our experience in the notoriously tight-fisted rights and monetization environment of sports has led directly to the requirement to create owned, curated data sources. This came about from four realizations that we think are more generally applicable to AR businesses…

Cool looking isn’t a business, even in sports.
It must be best shown in context, over video, or it won’t be shown at all.
The ability to technically execute AR is no longer a barrier to entry. Cutting edge visualization will only seem amazing for the next six seconds.
We established impossibly high quality expectations, and now the whole industry has to live with them.

13:30 Visualizing Geo Data

Jason Sundram (Where, Inc.)
In an increasingly mobile world, we are each generating tons of geo-tagged data. Photo uploads to Instagram, tweets, Foursquare check-ins, local searches, and even real-time public-transportation feeds are commonplace. The companies that gather this data make a lot of it freely available. The people who work for these companies have many opportunities to learn from this data. But in order to learn, we must first figure out what questions to ask. Visualization is a tool that helps us think of questions and begin to answer them.
There are 3 different major ways to think about geodata:

Over time
Aggregated spatially (e.g. by county)
Aggregated by density (e.g. heatmap)

Additionally, creating tools that allow users to explore data on multiple scales (i.e. zoom) is important, but adds complexity: you have to find a tile source and perhaps even render your data to tiles.
Choice of projection is key. Most of us grew up with the Mercator projection, but an equal-area projection is often a better choice.
I will take one data set and walk through visualizing it using the 3 approaches described above.
The first example will use Processing and Tile Mill to generate a zoomable animated map, playing back a month worth of data. I’ll show how to render the map to a movie for easy distribution.
The second example will use d3.js to show the same data at a county level in a chloropleth map. I’ll discuss color schemes and interaction, and compare what can be done with d3.js to Fathom’s Stats of the Union project.
The last example will talk about how to make a heatmap with millions of data points.

14:20 (1) Beautiful Vectors: Emerging Geospatial technologies in the browser

Mano Marks (Google, Inc. ), Chris Broadfoot (Google)
Beautiful, useful and scalable techniques for analysing and displaying spatial information are key to unlocking important trends in geospatial and geotemporal data. Recent developments in HTML 5 enable rendering of complex visualisations within the browser, facilitating fast, dynamic user interfaces built around web maps. Client-side visualization allows developers to forgo expensive server-side tasks to render visualisations. These new interfaces have enabled a new class of application, empowering any user to explore large, enterprise-scale spatial data without requiring specialised geographic information technology software. This session will examine existing enterprise-scale, server-side visualization technologies and demonstrate how cutting edge technologies can supplement and replace them while enable additional capabilities.

14:20 (2) It’s Not “Junk” [Data] Anymore

Kaitlin Thaney (Digital Science), Mark Hahnel (FigShare), Ben Goldacre (Bad Science)
In a research environment, under the current operating system, most data and figures collected or generated during your work is lost, intentionally tossed aside or classified as “junk”, or at worst trapped in silos or locked behind embargo periods. This stifles and limits scientific research at its core, making it much more difficult to validate experiments, reproduce experiments or even stumble upon new breakthroughs that may be buried in your null results.
Changing this reality not only takes the right tools and technology to store, sift and publish data, but also a shift in the way we think of and value data as a scientific contribution in the research process. In the digital age, we’re not bound by the physical limitations of analog medium such as the traditional scientific journal or research paper, nor should our data be locked into understandings based off that medium.
This session will look at the socio-cultural context of data science in the research environment, specifically at the importance of publishing negative results through tools like FigShare – an open data project that fosters data publication, not only for supplementary information tied to publication, but all of the back end information needed to reproduce and validate the work, as well as the negative results. We’ll hear about the broader cultural shift needed in how we incentivise better practices in the lab and how companies like Digital Science are working to use technology to push those levers to address the social issue. The session will also include a look at the real-world implications in clinical research and medicine from Ben Goldacre, an epidemiologist who has been looking at not only the ethical consequences but issues in efficacy and validation.

16:00 From Big Data to Big Insights

Robbie Allen (Automated Insights)
With recent advances in linguistic algorithms, data processing capabilities and the availability of large structured data sets, it is now possible for software to create long form narratives that rival humans in quality and depth. This means content development can take advantage of many of the positive attributes of software, namely, continuous improvement, collaborative development and significant computational processing.
Robbie Allen, the CEO of Automated Insights, and his team have done this to great effect by automatically creating over 100,000 articles covering College Basketball, College Football, NBA, MLB, NFL in a 10 month period. Automated Insights is now branching out beyond sports into finance, real-estate, government, and healthcare.
In this talk, Robbie will share the lesson’s his company has learned about the viability of automated content and where the technology is headed. It all started with short sentences of uniform content and has expanded to the point where software can generate several paragraphs of unique prose highlighting the important aspects of an event or story.

16:50 Exploring the Stories Behind the Data

Cheryl Phillips (The Seattle Times)
A story on the U.S. Census will tell the broad themes behind the data and use people to exemplify those themes. But what every reader also wants to know answers to more specific questions: How did my community change? What happened where I live, in my neighborhood? And being able to provide those answers through an interactive visualization is what story-telling through the data is all about. A story or report on a subject by its very nature summarizes the underlying data. But readers may have questions specific to a time, date or place. Visualizing the data and providing effective, targeted ways to drill deeper is key to giving the reader more than just the story. The visualization can enhance and deepen the experience. Cheryl Phillips will discuss data visualization strategies to do just that, providing examples from The Seattle Times and other journalism organizations.