The process of re-identifying individuals refers to using anonymized data to find individuals in public datasets. In order to re-identify individuals in large datasets, all you need is a laptop, an internet connection and public datasets and you can start digging for personal identifiable information (PII) hidden in the dataset. It looks simple; it is difficult but not impossible as researches from the Whitehead Institute recently showed. They were able to re-identify 50 individuals who had submitted personal DNA information in genomic studies such as the 1000 Genomes Project.
As with surnames, the Y Chromosome is passed on from father to son and using this information they started analysing public database that housed Y-STR data and surnames. They linked public datasets to the dataset collected by the Center for the Study of Human Polymorphisms (CEPH) to identify 50 men and women out of data that was de-identified. With more and more public datasets becoming available could the re-identification of individuals pose a real threat to the use of Big Data and open datasets?
Re-identification of individuals could lead to privacy issues and information becoming publicly available that should not have been released. The re-identification of Massachusetts Governor William Weld, who collapsed on stage while receiving an honorary doctorate from the Bentley College, caused a stir. In 2010, using a dataset released by the Massachusetts Group Insurance Commission to improve healthcare and controlling costs, MIT graduate student Latanya Sweeney was able to re-identify Weld using some simple tactics and a voter list. Eventually this study led to the development of the de-identification provisions in the American Health Insurance Portability and Accountability Act (HIPAA).
Re-identification of individuals can have some serious consequences when, for example, private health information is recovered that could lead to discrimination, embarrassment or even identity-theft. Or one could imagine medical records that influence a child custody battle. That is why the HIPAA has included 18 specific identifiers that must be removed prior to data release. Unfortunately it does not stop people from trying to re-identify individuals in large datasets.
A well-known example is the re-identification of a dataset from Netflix done by Arvind Narayanan. The study used a public datasets as part of a contest that was organized by Netflix to improve its movie recommendation engine. Narayanan and his team were able to re-identify the anonymous database and this study lead to a privacy lawsuit against Netflix that consequently cancelled a second contest in 2010. There are more examples of researchers re-identifying individuals in large datasets and as long as it is done by researchers with good intentions it seems right. Imagine however, when hackers with bad intentions start doing the same?
A solution could be to require organisations to do a threat analysis on the dataset prior to releasing it to the public; check for datasets that are available online that can be used to re-identify the people in the dataset. However, not many organisations are doing this and, as Narayanan explains, it is a tricky business as future datasets could still cause a problem for anonymity. In order to solve this problem for contests such as the Netflix contest, Narayanan describes two rules that could help: 1) use a fabricated small set of data for the first round for contenders to develop a code and algorithm and 2) have the finalists sign a NDA prior to releasing the full dataset to them.
On the other side, how likely is it and how much effort does it take to re-identify individuals in the massive amounts of datasets? Dr. Latanya Sweeney reported in 2007 that 0,04% (4 in 10.000) of individuals in the USA that appear in datasets that have been anonymized according to HIPAA standards can be re-identified. In perspective: this risk is slightly above the lifetime odds to be struck by lightning (1 in 10.000).
If that is the case, perhaps we should not worry too much about it as long as the necessary precautions are taken into account as defined by the HIPAA and perhaps we should see it as a risk that is part of life. If we do not want to accept this risk, we should perhaps abandon the usage of public datasets completely? However, as Daniel Barth-Jones (an epidemiologist and statistician at Columbia University) explains, important social benefits of de-identified public datasets as well as business, commercial, educational benefits and innovation opportunities can be lost if we stop using and analysing de-identified data.
Apart from the small risk of being re-identified, it is also rather difficult to determine characteristics of individuals in public datasets. As Barth-Jones writes in a study in 2011, “each attack must be customized to the particular de-identified database and to the population as it existed at the time of data-collection”. In addition to that, Paul Ohm, associate professor of law at the Colorado Law School, assures us that trustworthy re-identifications always require labour intensive efforts. It is time-consuming, requires serious data management and statistics skills and it simply lacks the easy transmission and transferability as seen in computer viruses.
Of course, this does not mean that we can stop paying serious attention to re-identification risks. Technology is moving forward, including re-identification techniques and as we leave more data traces online, it will become easier to re-identify individuals if measurements are not taken accordingly. Measurements, such as forcing Facebook to shut down their facial recognition feature as imposed by the European regulators recently, will be necessary, as companies will try to push the bar in privacy law and hackers will always do their best to find incriminating information. Therefore, we should constantly reassess and strengthen de-identification and re-identification management techniques as technology is improved to ensure that public datasets can also be used in the future to drive innovation.