In early February, I attended the Strata Conference in Santa Clara. I’ve had a bit of trouble carving out time to write concluding thoughts about the event (time deficiency is mainly due to last minutes activities related to our Learning and Knowledge Analytics Conference) . Overall, it was a good conference and one that serves an important marking point in “big data comes of age” in terms of media attention, startups, and recognition of potential impact in business sector.
It’s difficult to escape the growing attention being paid to data and analytics. In the business sector, the data/analytics combination falls under the marketable term of “business intelligence”. A complex world with an almost incomprehensible amount of data requires a different response to sensemaking and management than what has been effective in the past (PW Anderson’s term of “more is different” is rather appropriate). Texts aimed at the mass market (Competing on Analytics, The Numerati, Super Crunchers), as well as more academic texts (Handbook of Educational Data Mining) are starting to capture the importance of data analysis as a framework for decision making. Health insurers combine social media and existing data (credit ratings, location, employment) to more accurately, and cost effectively, assess risk. Universities are attempting to put a price on professors to determine where resources should be allocated. Through social network analysis (email, LinkedIn, Facebook) organizations are able to gain a better understanding of the social and collaboration fabric that influences how work gets done.
Now that it’s explicit, we can analyze it
The somewhat unanticipated side effect of the growth of the internet, social media, and mobile technologies has been the astonishing amount of our personal lives made explicit. Now, as the internet of things takes hold, sensor data is set to exceed data humans produce (see also HP’s CeNSE project).
Data is unreasonably effective in providing new insights that aren’t necessarily discoverable through traditional scientific methods. Now that the physics of data have been altered due to increased speed of processing, scaling of technology and data systems, and contribution of sensors to data, it’s time to acknowledge a dramatic shift in all aspects of business, government, health care, and education: data is the new value point, the new basis for decision making, the new foundation of competitiveness, and the new critical resource which threads through all aspects of any organization.
What we are witnessing here is the formation of a discipline and approach to data analysis that will reverberate through government, healthcare, business, and education over the next several decades. The internet and the web (in its web 2.0 and social media configurations) has resulted in new modes of connecting with others and generating and sharing information. Big data offers a new means for research, business, decision making, and gaining insights into complex phenomenon. O’Reilly has stated that “data is the new Intel inside”. A nice quote, but I think it’s more: data is the new computer revolution.
Does that sound a bit overstated? I don’t think so. I’ve been interested in the growth of data (peripherally) since early 2000. Since 2005, and this is reflected in the open online courses I’ve run with Stephen Downes, Dave Cormier, and others, the shortcomings of current educational models have become increasingly evident. New modes of sensemaking – through social and technological systems – are required. Much like the identification of the SARS virus in 2003 required a global network of information sharing and exchange, complex problems are not solvable, or even understandable, through linear systems. Interestingly, the focal point of understanding a concept has shifted. For example, most research projects begin with a hypothesis, followed by data collection and analysis. In contrast, an astonishing amount of data on the economy, government, healthcare, global trade, educational attainment, and so on, is now available for analysis. Questions are no longer exclusively asked in advance of data collection. Existing data, collected over decades (or in Google’s case, recently digitized from previous centuries), awaits a clever algorithm or a provocative question. Remember Wikipedia Scanner from 2007? All that editing data, just waiting for someone to find a way to analyze it and reveal embarrassing, even unethical, edits.
Ok, so data is the new foundation of business. And society. And our economy. On to the Stata Conference. The new role of data in society is why O’Reilly’s Strata Conference was (and is – a New York version has been announced for September 2011) an important event. O’Reilly has made recordings and slides available. It’s well worth taking a few hours to review.
A quick look at the sponsor pavilion provides a sense of where organizations are trying to make money off of big data. At this stage, the focus is on data storing and manipulation with some emphasis on extracting “intelligence” from the data. The presence of numerous startups as well as sessions on “how to make money from big data” reflected the entrepreneurial bent of the conference.
On the third day of the conference, the $3 million Heritage Health Prize was announced. The message is clear: big data=big money. A clever algorithm or new insights drawn from existing data can have a huge financial impact on an organization or society. Data analysis provides new ways of uncovering inefficiencies or providing interventions (in the case of health care) before patients require admission to a hospital.
It’s also interesting to note who was at Strata – Amazon, EMC, Wolfram, numerous startups (does Cloudera still count as a startup?), Infochimps, and IBM. Even more interesting to note is who wasn’t there: Yahoo, Oracle, Google, and Microsoft (Google/Microsoft had representation in the program and Microsoft was in the vendor area, but both seemed diminutive compared with the prominence of Amazon and others). Especially interesting to note Google’s lack of presence given that MapReduce and Google File System inspired Hadoop.
The taxonomy of big data value
Based on what I saw at the conference (and read in literature coming up to the event), I’ve put together a taxonomy of themes and value points in relation to big data:
Database & data cluster providers: EnterpriseDB, Greenplum, Hadoop, Amazon (hosted, on-demand services)
Content: curated, computed, indexed/aggregated: Wolfram Alpha, Thomson Reuters
Data markets: Windows Azure Marketplace, DataMarket
Intelligence: Splunk, R, Tableau, IBM
Visualization: Gephi, tools within LinkedIn, Facebook.
As indicated in the image below, each element builds on previous ones.
In Jake Porway’s post on Strata Roundup, the data science field is starting to mature, as indicated by distinctions and clarifications between concepts of big data vs. data science, using vs. showing data, and so on. Fields, when young, are ambiguous and terms carry multiple meanings. Once we start to pull apart concepts and draw focused distinctions, we can begin to see the field in greater depth.
Drew Conway states states: “I think the future of big data, for me, is going to be, where you can take data sets that are large, but are also meaningful, together, and find ways to put them together to find [new] meaning.” Most of the focus at Strata Conference was on databases, clusters, and techniques. A few examples of different data initiatives were provided (Guardian Data), but the discussion mainly focused on tools and early methods. The intelligence (or “new meaning” that Conway mentions) is still a bit in the future. While some interesting visualization tools already exist, but skills and technologies for effective visualization will require more time to develop. Visualization is a discipline in itself – communicating and presenting images to convey meaning requires the mindset of an artist. The importance of visualization was at least partly addressed during Strata.
Big job opportunities
A few of the individuals presenting at the conference – notably Drew Conway and Hilary Mason – exhibited the skills of end-to-end data science. I think they are an anomaly. Most data initiatives in corporations, government, health care, and higher education will be team-based. It’s the (very) rare individual that has the skills of Drew and Hilary: python programming, R, Hadoop, visualization, and so on. As methods and analysis techniques become more developed, data analysis will experience the same specialization that computer science experienced in the mid-90s (remember when organizations used to have “the computer guy” to go to for all computer/software/network/application related problems? Single source doesn’t work as complexity intensifies).
So what does a data science team look like? These teams need five main roles:
- Stakeholder – funder, manager, researcher, or policy maker
- Data scientist – essentially the data science team leader: needs a broad understanding of data analysis, techniques, and types of questions that can be asked of data, techniques for cleaning and using data, recognition of deficiencies in structured/unstructured data, as well as proficiency with data clusters
- Programmer – writes the code to access, clean, and filter data.
- Statistician – perhaps I’m projecting my weakness, but any reasonable size data team will require a statistician to make sense of the patterns evident in large data sets
- Visualizer – designing a great data team, architecting a strong data analysis model, pulling, filtering, and analyzing data is a great start. Presenting the data in a manner that helps others understand what it means is critical. A skilled data visualizer (not the best term, I know) is critical to a success data team.
Implications for education
My main disappointment in the Strata Conference was the lack of attention paid to education. Only one session in the conference had an educational focus in the description…but the session didn’t get discuss the potentially impact of analytics in K-12 and higher education. This void was particularly pronounced due to the attention the US is currently giving to improving education and educational attainment. For example, EDUCAUSE’s Next Generation Learning Challenges Wave I finalists has ~15 out of 49 projects focused on analytics or adaptivity of learning content based on analytics. Learning analytics can help to open up the black box of education and provide valuable insight into what’s working (or not) and ways to adapt and personalize content and the learning process. If you’re interested in additional resources, the syllabus from the course Learning and Knowledge Analytics is available (we wrapped the course up on Friday).
Learning management system and higher education ERP vendors are turning their attention to analytics:
- iStrategy’s focus of “business intelligence for higher education”.
- Blackboard Analytics: “Transforming data into actionable information. Enabling informed decision making and improved performance.”
- Sungard’s purchase of Purdue’s Signals
I’ve posted additional resources on diigo stratconf tagged on diigo