Introduction
This post has been lingering in the back of my mind for quite some while. The developments are so fast-paced at the moment, that it’s hard to keep up with it, let alone writing a post on the topic.
This post is heavily inspired by Cory Doctorow’s talk at the 28th Chaos Communication Congress aptly titled “The Coming War on General Purpose Computation” (the transcript of which can be found here.)
At the core of my argument lies the analogous application of Doctorow’s argument in regard to computation of data, that is: Data is not easily constricted in purpose without fundamentally crippling that which we understand as computation. It is essentially reversing the vector of ”The War on General Purpose Computation,” not aiming at our use of copyrighted material and industry’s attempts to censor that use, but rather looking how we’re unable to effectively restrict Governments and industries (and even each other) in their use of “our” data.
And while similar arguments have been made by critics of “Big Data,” they themselves massively underestimate the magnitude of changes to come. It is my conviction that Big Data, although very useful in laying out the fundamental shifts, is a concept too small in scope to appropriately describe the world we’re building.
Everything is Data, Data is Everything
In the United States, Congress is currently deliberating on a privacy bill, which tries to balance legitimate business needs with regard to data harvested on the internet, and users legitimate and — in the US especially important — reasonable expectations of privacy.
The types of data talked about in the public debate are revealing. The New York Times recently made headlines detailing Targets GuestID system, which allowed for a comprehensive profiling and predictive targeting of its customers. Google is increasingly coming under scrutiny with regard to its new privacy policy, which abandons the walls between user data collected on its many properties which were meant to prevent the capabilities of detailed and comprehensive profiling of the users of its service.
Adding to that, an app called “Girls around me,” which extracted the location check-ins of female users of foursquare and facebook, combined them with publicly accessible profile information, and packaged them into an app targeted essentially at the male “pick up” clientele, sparked a controversy over its “creepy factor.”
And this is just what’s happening now. Data is being collected, harvested, hoarded and analysed at seemingly every corner. As last year’s visualisation of German politician Malte Spitz’ cell phone location data for the newspaper “Die Zeit” has shown, this data alone can paint a comprehensive picture of his movements. Similarly, legislation requiring British telecommunications companies to save almost all aspects of its customers online communication is currently under discussion at the House of Commons.
Add to that a multitude of additional data sources, from the location data of your shared photographs to how well you slept last night. And we’re just getting started.
The “Quantified Self” is just about to break into the mainstream, with products like the Jawbone UP, the Nike Fuelband (and it’s predecessor/companion, the Nike+), the Withings scale and a plethora of other sensor-equipped hardware currently aiming to conquer first the fitness market, and then the everyday.
And it doesn’t stop there. We’re analysing our attention data, how many phone calls we make, and when, how many emails we write and how productive we are. And if we zoom out, we see this happening on a large scale, too. The Air Quality Egg, a crowdsourced approach to measuring air quality, reached its Kickstarter funding goal in a mere three days, accompanying other initiatives like Safecast.org, which is a volunteer organisation that measures radioactivity levels or the New York based project, which measures the sewer levels, trying to, by making this information publicly available, avoid sewage spill-over into the river water.
The scope and size of data being collected right now is unimaginable and unprecedented. And it’s growing.
Data-Driven Modelling
Add to that the fundamental shift we are witnessing right now in the way in which data gets collected, stored and analysed.
Collecting, storing and analysing data used to be so expensive and time-consuming that the only reasonable way to go about it was to do it when it was necessary. You wanted to have a model and a hypothesis about how that thing worked you were looking into.
You would build a model with a hypothesis in mind, and gather the data to analyse and then either confirm or refute the hypothesis.
We’re now in an age where collecting data is the new normal, where storing and processing have come down in price so much, that in a lot of cases simply adding storage is now cheaper than deleting data.
This reverses how we view data and interact with it. The sensible approach now is to collect the data now and see what you can find in it later. And a lot of the time, you can even let machines to that job for you. Machine learning is already at the point where it can extrapolate fundamental laws of physics from just watching a pendulum for two hours.
This means that I don’t necessarily have to have an idea of what I want to look for in the data. And it means that I don’t necessarily have to expect to find anything in the data — most likely someone will. This means that any sort of data is potentially valuable, even if the value is only recognised and realised much later, as we’re building models out of the data we have, not looking for data to fit our models.
Recontextualization
This collected data is not easily constricted in purpose. The data you gather now can be used in a multitude of purposes and contexts, none of which might have even existed at the moment of actually gathering the data.
What this development implicates that value I’m extracting from the data, the insights I glean, have nothing to do with the circumstance and context in which the data was collected. Think, for instance of the San Francisco parking system that is currently under construction. On the face of it, it is an elaborate sensor system which tries to identify empty parking spaces and communicate them to drivers looking for parking, while at the same time by using flexible parking rates maximise the income potential for the municipality. This sounds like a reasonable and innocent use of technology. But then you realise that you have profiles of movement for many of the participants in the system.
And if you look at a recent story in the UK, where it has been proposed that drivers that have not paid the road taxes be identified at gas stations by the omnipresent CCTV and be denied filling their vehicles, you get how easily this data can be bent in its use.
It is indeed not the data itself which we are concerned with. It is the recontextualisation of this data that worries us. It’s not check-ins on foursquare which are objectionable, it’s the repurposing this data for use in potentially crime-inviting packaging, be it in the form of “Girls Around Me,” or the previous “Please Rob Me.”
It is this recontextualization of data which gets privacy advocates up in arms, and which is not easily constricted. There’s no technical system which allows us to say: you can use this data for these purposes, but not for those, just as there is no easy fix in hindering me in copying a sound file.
We can’t control where the data goes that we share, or that we don’t even know exists about us. We should work towards rules that at least let us know what’s being done with it.