IBM Chief Scientist, Jeff Jonas has been working for the past several years on a project code name G2 but later title Sensemaking. It is a big data analytics project which seeks to take all of the disparate data that enterprises get from various sources and make sense of it so that relevant action can be taken. Jeff’s analysis show that companies often have information somewhere but don’t make proper use of it. For instance, one retailer they looked at had a problem. Two out of every thousand new hires at their stores, it turned out, had been previously arrested for stealing from those stores.
Jeff analogizes data and observations to puzzle pieces and grouping these pieces to how we organize information by contextualizing it. Just like building a puzzle, the first few pieces come together fairly quickly, then it slows down, but as you get more and more pieces laid on the table, the picture becomes clearer and it become easier to fit pieces into their correct spot. Finally with each piece of the puzzle, the Sensemaking engine finds relevance to who needs to know about that information. Rather than an analyst querying the data, the information finds the analyst who needs it. Data find data and information finds relevance. There is an excellent video of Jeff explaining the project and how it works available here. Now this is a very high level view of Jeff’s work and I didn’t want to spend too much time going through because we’re not here to talk about IBM’s product, we are here to talk about privacy and Jeff, as a conscious effort, designed privacy into his Sensemaking engine.
There were seven features he added to his system that were done in the name of privacy. All of these were built in such that IBM couldn’t sell the system with out them. Three of them were mandatory and not optional by the purchasing customer.
- False negative favoring (mandatory) – A false negative is something that is true but which is not detected. By not making the assertion that something, which is true, is true the system produces a false negative. Why is this important in privacy? Think about the concept of innocent until proven guilty. Unless we can be sure of our assertions, the prudent course is not to make it. We don’t want to be sending people to jail or denying them admission because of something we assume but may or may not be true.
- Self-correcting false positive (mandatory) – Closely related but not the same, this feature says that if we have a positive match on data and new information arrives that negates that match (in other words it was a false positive), then we need to correct it. That correction happens in real time, not days, weeks or months later.
- Data tethering- Many systems, especially in the big data world, update in batch on a periodic basis. However, in IBM’s sensemaking system data from the source system is tethered to the sensemaking engine. This means that when changes appear in the source system, those changes automatically propagate and assumptions based on that data are altered (in real time).
- Full attribution – Understanding where data is coming from is imperative for privacy. If you’re going to be making decisions about people based on data, you need to be able to attribute that data to a source in case that source turns out to be faulty. The Sensemaking system show exactly where all of the data came from and therefore allows assertions made not to be made arbitrarily.
- Tamper resistant logs – People have a tendency to snoop. Numerous incidences have occurred where law enforcement officers use official databases to track down ex-lovers. Tax authorities look at the tax returns of celebrities. In order to identify these breaches, it is important to have tamper resistant logs of all activity. Not only will it help catch breaches but the knowledge of the logs existence has a chilling effect preventing the breaches in the first place.
- Transfer accounting – Knowing where you send information is important. Should anything occur downstream, you can identify what you sent, to whom and for what purpose. This is common in credit reporting agencies but not always found elsewhere.
- Anonymized analytics – Jeff built in the ability to transfer anonymous data downstream. This allows statistical calculations on data without revealing personally identifiable information.
For more on IBM Sensemaking and the quest to built in Privacy see his paper at the privacy by design website. Also he has additional information on each of the privacy features on his blog: jeffonas.typepad.com
On Tuesday, November 19th I presented a talk on Privacy at the 2013 Intel Security Conference. During this talk, I used Jeff’s Sensemaking engine as an example of how to build privacy into your engineering project.