De-Identification is a technique, often employed with large data sets, to remove identifying information with the goal of still being able to gain the benefits of correlative analysis while reducing the privacy risks associated with having so much data on individuals. Much of the privacy literature around Big Data privacy focuses on de-identification and the possibility of re-identification. Several prominent examples of re-identification of individuals exist including the re-identification of Massachusetts Governor William Weld from anonymized health information, the correlation of anonymous NexFlix viewers with public IMDB reviewers and the re-identification of individuals from AOL search data. Since these highly publicized incidences a lot of work has gone into improving de-identification techniques in such a way as to understand and minimize the chance of re-identification.
At the recent IAPP Privacy Academy in Seattle Washington, I sat in on a session entitled Taming Big Data in which a good part of the focus was de-identification and the possibilities of re-identification. However, this focus, even within the Privacy community, on identifiably as the sole source of privacy violations ignores an entire class of potential harm. Not all privacy violations target the individual and some may affect society in ways that the individual does not wish to participate.
The promise of big data in teasing out previously unknown correlative events is huge. Sometimes the data going in has no relevance to persons (for instance weather information) but sometimes it does. Personal data can be used in an aggregated format to predict patterns of behavior or suggest how social actors will respond to various stimuli. For instance, Big Data could help a grocery store fine tune prices to maximize profit. They may find that certain neighborhood characteristics (propensity to like sports, average height, etc) may correlate well with a price point of certain items. This can then be used to lower or raise the price of those items in stores in neighborhoods with those characteristics. Society is finicky when it comes to price discrimination. We accept that passengers on a plane may pay wildly different prices for transport from one location to the next. We also accept price discrimination (via coupons) for people who have more time (to clip coupons) than money to spend on similar goods. Similarly, business may discriminate based on group membership: seniors, students, military. While businesses appear to be offering socially beneficial discounts to certain groups, the truth is they have to offer these discounts to the groups in order to attract their business because of their general lowered willingness to pay full price for the service.
Not all price discrimination is socially approved. In fact, some may view price discrimination as a form of fundamental unfairness and inequality. Therefore the use of my information to price discriminate (even if it positively effects me or doesn’t affect me at all) may be deemed a privacy violation despite the fact that it was aggregated and de-identified. The control of de-identification doesn’t address the risk that the aggregate data may be used in a way I deem socially unacceptable, be it price discrimination, assessing credit risk or policing. Even if my individual contribution is de minimus, the collective contribution of all the people data may have an affect on society that I don’t want to participate in.
There is certainly a public policy argument to be made whether the use to which the data will be put outweighs the privacy violation but it can not be the balance between an individual’s privacy violation and social good. In that circumstance, the individual will always lose. The comparison must be fair and to do that, one must way the data use against the collective harm to the interference in decision making of the population whose data is to be used. Does the social benefit outweigh the social harm from ignoring and disrespecting people’s decisions about their information. For this particular privacy harm, de-identification is irrelevant.