I’m at the Strata Conference this week. It’s the first conference I’ve intended in years where I don’t have to speak or do anything except listen and learn.
Yesterday, after fortunately meeting up with David Wiley, I spent the day in a “data bootcamp“. Slides, code, and data sources are available on github. This was an engaging session where audience members were encouraged to code along with the presenter. Most attendees hadn’t downloaded the needed tools (Python, Python libraries (i.e. scipy, numpy), R). The room was long and narrow and period requests from speakers as to “can you see this on the screen” were not helpful to anyone in the back half of the room. I did find the discussion of k-means cluster and k-nearest analysis of individual pixels interesting, but I don’t think it will be on my horizon of interest anytime soon!
Hilary Mason and Drew Conway provided a great overview of how to extract and visualize email data (slides and code included in the github link). I was less interested in the technical details of the process…and more interested in simplicity of the process and the functionality of open source tools used in performing the analysis.
My main disappointment with the workshop (aside from a few concerns about conference organization – the session started late, audio problems, screen locations for room layout) was that it didn’t fully address the topics in the outline. In particular, the discussion of “how to build a data science team” was missing. Overall, a great session!
“We don’t need no education” – no longer true
The focus of data analysis for Strata is very much on the business sector. I’ve only heard a few mentions of government data, none on healthcare (until the $3 million healthcare data challenge this morning – more on that in a later post), and none related to education. This is likely due to first-year conference syndrome where the conversation is more abstract and focused on establishing some foundational language and clarification of important concepts. O’Reilly announced a Strata Conference in New York in September – hopefully they’ll have increased the emphasis on healthcare and educational data analysis. I’m obviously a bit biased due to running the Learning and Knowledge Analytics open course and conference (only a few weeks to register!).
As I listened to different presentations and participated in hallway conversations, it quickly became clear that data science is not going to have the same boot-strapping feel that hackers and self-taught coders have enjoyed. A 15-year old kid, learning programming in her/his spare time, could write a program and contribute to open source programs without any formal education. I don’t think that’s going to be the case in data science, primarily because it sits in several domains: math, statistics, programming, networks/complexity theory, distributed data storage/analysis, etc. At best, data science requires a team, or a fairly well educated statistician with a well-developed hacking streak.