In a TED talk video, Altimeter industry analyst Susan Etlinger discusses the ethical implications of big data, and how we can use it to extract real insights in a way that builds trust. iSGTW recently followed up with Etlinger about her pursuit of a framework to help traditional analytics methodologies account for the variation and complexity of big data.
What's your definition of big data?
I tend to use Gartner's definition, which begins with “The Three Vs”: volume, velocity and variety. Big data has all of these attributes, although in the past several years variety has been most challenging. One example of this is the variety in social data: a post can contain language, images, video, metadata, or a button (retweet or share) that also carries a range of interpretation. Of course, there are many, many other kinds of big data: transactions, log files, data from sensors, and so on.
What do you see as the promise/potential of big data?
I would hope– and I stress hope– that big data can help us identify patterns and discover insights that we would never have seen otherwise. This has particular relevance in genomics, because genes are of course so complex, and the expression of these genes is even more so.
So when we look at diseases, conditions, and neurological issues, understanding these patterns can (hypothetically) help us find more targeted cures. With breast cancer, for example, big data analysis has uncovered insights about which subgroups of breast cancer have the best survival rate (learn more here). IBM is also doing some very interesting work with Watson, related to designing the hospital of the future (watch TED@IBM).
Who does big data affect? What are some of its larger implications?
Big data can affect most of us, really. If we use a smartphone, are photographed, interact via social media, buy or sell securities, have a surgical procedure, we are creating big data. In academia, big data offers a new data set to understand consumer and patient concerns. How do people talk about smoking? Addiction? Knee pain? Cancer? Depression?
The work that the Health Media Collaboratory from University of Illinois at Chicago, US, is doing is an excellent example of how public health researchers can use big data to better understand consumer attitudes about smoking, smoking-related advertising, and other issues.
Can you describe examples of misinterpreted big data?
There are countless ways to interpret and misinterpret big data. For example, if you survey 10,000 people on a topic, you can have a pretty strong level of confidence in your results. If you pull 10,000 tweets on a specific topic, you may have issues related to sourcing (where you got the data), sampling (how representative it is), meaning (how it's interpreted), filtering (spam versus legitimate posts), enrichment (inferences about demographics and location), and aggregation (what it means in relation to other data), among other things.
When you pull 10,000 tweets on the same topic on two successive days, you'll see differences in the results. And that doesn't even account for analytical and logical flaws. The outcome could be misdiagnosis or misallocating research funds. The most important thing we have to do is accept that big data brings with it a huge amount of uncertainty. To mitigate that uncertainty, we have to reassess the methodologies we use to interpret it.
You point to the humanities as a necessary context for interpreting data. Why?
So much of unstructured big data comes from human expression, so we need disciplines such as linguistics, ethics, rhetoric, sociology, anthropology as well. If we're looking at complex, volatile, and messy data, interpretation and analysis becomes dependent on our ability to understand human means of expression. That is a hard, hard challenge, especially in multiple languages and across multiple cultures.
The key, ultimately, is context– and, without getting too metaphysical, what constitutes context is contextual. Ultimately I think it comes down to clarifying the desired research question, and making it as specific as possible:
- Is there a link between X and Y?
- Can we detect whether it is correlative or causal?
- What is the likely impact of X on Y population?
- With what confidence level are we able to determine this?
In this sense, big data analysis is, as Philip Stark at UC Berkeley has said, really another name for applied statistics.
Does what constitutes a valid interpretation shift from domain to domain?
Oh absolutely. And this is where we need to think about use cases. How open ended is the inquiry? How structured is the data? Do we have a hypothesis? Are we looking at a limited or broad number of variables?
When it comes to language, for example, I believe that the simpler the question, the clearer the signal. So: are people talking about something? In what language? With what sentiment? How confident are we about it? Would multiple raters have the same or substantially similar results? Have we accounted for language and cultural differences?
The challenge here is that we can't always know what we don't know. In this way, the study of data will always be a little bit art and a little bit science.
For more on the necessary contexts and ethical challenges of interpreting big data, read Etlinger's latest research report.