Saturday, 24 November 2012

Maxims for Privacy

A collection of maxims relating to privacy might be interesting as we seem to have some common principles that fit well to this developing. Anyway here are some that come to mind from my work:

Don't collect what you don't use

If you have no use for the information you're collecting from the user, or can attribute no value to it in terms of the product being presented, then don't collect it at all.

If it looks like PII, it probably is PII, so treat it as PII

PII doesn't just mean identifiers but can extend to other kinds of data, eg: device identifiers, session identifiers, tracking etc...if you can map a set of data to a single unique person, or even a small, significantly small group of persons then it probably is PII.

Don't shock the user

You collected what??!?!!  You shared what without my permission?!?!  How did you learn that about me??!?! I didn't agree to that??!?!?  T&C's...I didn't understand them, let alone read them...and you still did 'x' !! Enough said...

Location data isn't just GPS co-ordinates

Maybe this could be rewritten as "Never underestimate the power of geolocation functions" however does come as a surprise that many types of data can be transformed into locations, even if very granular...who needs a GPS receiver anyway?!  Pictures often contain vast amount of EXIF data including locations, mobile phone cell IDs can be mapped to precise location and triangulated for example...

Good security does not mean good privacy, but good privacy doesn't come without good security

If your system isn't protecting the communications and data then it doesn't matter what you do to obfuscate or anonymise the data. Actually getting the balance between these sorted requires some serious engineering competence - security and privacy engineers are few and far between.

All information can be transformed and cross-referenced into whatever you need.

Need to think about how that one is written but whatever data you have can be transformed (eg: extracting usernames and other parameter data from URLs, meta-data extractions etc) and cross-referenced...

This one is more of an apophthegm, and a reminder that we need to serve the users not ourselves:

Security through Obscurity, Privacy through PowerPoint and Policies...

See Jim Adler's talk at #pii2012 on the Emergent Privacy-Industrial Complex - a term coined by Jeff Jarvis in an article called Privacy, Inc: Scare and Sell. But seriously we forget that ultimately privacy gets implemented through good, sound software engineering in C, Java, Python etc. Getting privacy as an inherent part of system engineering practice (not just processes) is probably the most critical thing to get all those policies, laws and good intentions implemented. As Schneier likes to use the term "security theater", I think we can often use "privacy theater" to reflect many of the things we are doing, see: Ed Felten's posting on this on the Freedom to Tinker site.

Thursday, 22 November 2012

Information Privacy: Art or Science?

I was handed a powerpoint deck today containing notes for a training course on privacy. One thing that struck me was the statement on one of the slides, in fact it was the only statement on that slide:


This troubles me greatly and the interpretation of this probably goes a long way into explaining some things about the way information privacy is perceived and implemented.

What do we mean by art, and does this mean that privacy is not a science?

Hypothesis 1: Privacy is an art

If you've ever read great code it is artistic in nature. You can appreciate the amount of understanding and knowledge that has gone into writing that code. Not just at the act of writing, or the layout and indentation, but in the design of the algorithms, the separation of concerns, the holistic bigger picture of the architecture. Great code requires less debugging, performs well, stays in scope, and if it ever does require modification, it is easy to do. Great programmers are scientists - they understand the value to the code, they avoid technical debt, they understand the theory (maybe only implicitly) and the science and discipline behind their work and in that respect they are the true artists of their trade.

For example, Microsoft spent a lot of effort in improving the quality of its code with efforts such as those the still excellent book Code Complete by Steve McConnell. This book taught programmers great techniques to improve the quality of their code. McConnell obviously knew what works and what didn't from a highly technical perspective based on a sound, scientific understanding of how code works, how code is written, how design is made and so on.

I don't think information privacy is an art in the above sense.

Hypothesis 2: Privacy is an "art".

In the sense that you're doing privacy well in much the same was as a visitor to an art gallery knows "great art". Everyone has their own interpretation and religious wars spring forth over whether something is art or not.

Indeed here is the problem, and in this respect I do agree that privacy is art. Art can be anything from the formal underpinnings of ballet to the drunken swagger of a Friday night reveler - who is to say that the latter is not art? Compare ballet with forms of modern and contemporary dance: ballet is almost universally considered "art" while some forms of contemporary dance is not - see our drunken reveler at the local disco...this is dance, but is it art?

Indeed sometimes the way we practice privacy is very much like the drunken reveler but telling everyone at the same time that "this is art!"

What elevates ballet, or the great coder, to become art is that they both have formal, scientific underpinnings. Indeed I believe that great software engineering and ballet have many similarities and here we can also see the difference between a professional dancer and a drunken reveler on the dance floor: one has formal training in the principles and science of movement, one does not.

Indeed if we look at the sister to privacy: security, we can be very sure that we do not want to practice security of our information systems in an unstructured, informal, unscientific manner. We want purveyors of the art - artists - of security to look after our systems: those that know and intuitively feel what security is.

There are many efforts to better underpin information privacy, rarely do these come through in the software engineering process in any meaningful manner unless explicitly required or audited for. Even then we are far from a formal, methodical process by which privacy becomes an inherent property of the systems we are building. When we achieve this as a matter of the daily course of our work then, and only then, privacy will become an art practiced by artists.

Tuesday, 20 November 2012

Understanding PII

The most important question in data management related to privacy is when a given set of data contains personally identifiable information (PII). While PII is fairly well defined (maybe not a strict formal defintion), often the question of what constitutes PII and how to handle it remains. In this article I intend to describe some notions on the linkability and tracability of identifiers and their presence in a data set.

It is worth mentioning that the definition of PII is currently being addressed by the EU Article 29 Working Party which will eventually bring additional clarification at a legal level beyond that existing the the current privacy directives. Ultimately however, the implementation of privacy is in the hands of software and system architects, designers and programmers.

I'll start with a (personal) definition: data is personally identifiable if, given any record of data, are there fields that contain data that can be linked with a unique person, or, a sufficiently small group with a common, unique attribute or characteristic.

We are required then to look at a number of issues:
  • what identifiers or identifying data is present
  • how linkable the data is to a unique person
  • how traceable a unique person is through the data set
Much is based upon the presence of identifiers in a data set and the linkability of those identifiers. For example, the presence of user account identifiers, social security numbers are very obviously in a one-to-one correspondence with a unique person and thus are trivially linkable to a unique user. Identifiers such as IMEI or IP addresses do not have this one-to-one correspondence. There are some caveats with situations such as  when one user account or a device is shared or used occasionally (with permission perhaps) by more than one person.

The notion of a unique person does not necessarily mean that an actual human being can be identified but rather can be traced, unambiguously through the data. For example, an account identifier (eg: username) identifies a unique person but not a unique human being - this is an important, if subtle distinction.

We can construct a simple model of identifiers and how these identifiers relate to each other and to "real world" concepts - some discussion on this was made in earlier articles. These models of identifiers and their relationships and semantics are critical to correct understanding of privacy and analysis of data. Remarkably this kind of modelling is very rarely made and very rarely understood or appreciated - at least until time comes to cross-reference data and these models are missing (cf: semantic isolation).
Example Semantic Model of Identifier Relationships

Within these models as seen earlier, we can map a "semantic continuum" from identifiers that are highly linkable to unique and even identified persons to those which do not.

Further along this continuum we have identifiers such as various forms of IMEI, telephone number, IP addresses and other device identifiers. Care must be made in that while these identifiers are not highly linkable to a unique person, devices and especially mobile devices are typically used by a single unique person leading to a high degree of inferred linkability.

Device addresses such as IP addresses have come under considerable scrutiny regarding whether they do identify a person. In cases where an IP address of a router has been used to identify is someone has been downloading copyrighted or other illegal material is problematical for the reasons described earlier regarding linkability. In this specific case of IP addresses, network address translation, proxies and obfuscation/hiding mechanisms such as Tor complicate and minimise linkability.

As we progress further along we reach the application and session identifiers. Certainly application identifiers if they link to applications and not individually deployed instances of applications are not PII unless the application has a very limited user base: an example of a sufficiently small,  group with common characteristics. For example, an identifier such as "SlighlyMiffedBirdsV1.0" used across a large number of deployments is very different from "SlighlyMiffedBirdsV1.0_xyz" where xyz is issued uniquely to each download of that application. Another very good example of this kind of personalisation of low linkability identifiers is the user agent string used to identify web browsers.

Session identifiers ostensibly do not link to a person but can reveal a unique person's behaviour. On their own they do not constitute PII. However session identifiers are invariably used in combination with other identifiers which does increase the linkability significantly. Session identifiers are highly traceable in that sessions are often very short with respect to other identifiers - capture enough sessions and one can fingerprint or infer common behaviour of individual persons.

When evaluating PII, Identifiers are often taken in isolation and analysis made there. This is one of the main problems in evaluating PII: Identifiers rarely exist in isolation and a combination of identifiers together reveals a unique identity.

Just dealing with identifiers alone, and even deciding what identifiers are in use provides the core of deciding whether a given data set PII. However identifiers provide the linkability, they do not provide tracability which is given through the temporal components which is often, if not, invariably present in data sets. We will deal with other dimensions of data later where we look deeper into temporal, location and content data. Furthermore we will also look at how data can be transformed into other types of data to improve traceability and linkability.

Monday, 19 November 2012

Do Not Track and Beyond (W3C Workshop)

There's a workshop on Do Not Track organised by the W3C:

W3C Workshop: Do Not Track and Beyond  

26-27 November 2012


Out of the April 2011 W3C workshop on Web Tracking and User Privacy, W3C chartered its Tracking Protection Working Group, which commenced work in September. The Working Group has produced drafts of Do Not Track specifications, concurrent with various implementations in browsers and Web sites and along side heightened press and policymaker attention. Meanwhile, public concern over online privacy — be it tracking, online social networking or identity theft — remains. 

A large number of very interesting papers to be presented including one of mine based on earlier articles I have written here on a dystopian post-DNT future and the update to that.

Ian Oliver (2012)  An Advertisers Paradise: An Adventure in a Dystopian Post-“Do Not Track World”? W3C Workshop: Do Not Track and Beyond, 26-27 November 2012.


Tuesday, 13 November 2012

Measuring Privacy against Effort to Break Security

As part of my job I've needed to look at metrics and measurement of privacy. Typically I've focussed on information entropy versus, say, number of records (define "record") or other measurements such as amount of data which do not take into consideration the amount of information, that is, the content of the data being revealed.

So this lead to an interesting discussion* with some of my colleagues where we looked at a graph like this.

The y-axis is a measure of information content (ostensibly information entropy wrt to some model) and the x-axis a measure of the amount of force required to obtain that information. For any given hacking technique we can deliniate a region on the x-axis which corresponds to the amount of sophistication or effort placed into that attack. The use of the terms, effort and force here come from the physics and I think we even have some ideas on how the dimensions of these map to the security world, or actually what these dimensions might be.

So for a given attack 'x', for example an SQL inject attack against some system to reveal some information 'M', we require a certain amount of effort just for the attack to reveal something. If we make a very sophisticated attack then we potentially reveal more. This is expressed as the width of the red bar in the above graph.

One conclusion here is that security people try to push the attack further to the right and even widen it, while privacy people try to lower and flatten the curve, especially through the attack segment.

Now it can be argued that even with a simple attack, over time the amount of information increases, which brings us to a second graph which takes this into consideration:

Ignoring the bad powerpoint+visio 3D rending, we've just added a time scale (z-axis, future towards back), we can now capture or at least visualise the statement above that even an unsophisticated attack over time can reveal a lot of information. Then there's a trade-off between a quick sophisticated attack versus a long, unsophisticated attempt.

Of course a lot of this depends upon having good metrics and good measurement in the first place and that we do have real difficulties with, though there is some pretty interesting literature [1,2] on the subject and in the case of privacy some very interesting calculations that can be performed over the data such as k-anonymity an l-diversity.

I have a suspicion that we should start looking at privacy and security metrics from the dimensional analysis point of view and somewhat reverse engineer what the actual units and thus measurements are going to be. Something to consider here is that the amount of effort or force of an attack is not necessarily related to the amount of computing power, for example, brute forcing an attack on a hash function is not as forcible as a well planned hoax email and a little social engineering.

If anyone has ideas on this please let me know.


[1] Michele Bezzi (2010) An information theoretic approach for privacy metrics. Transactions on Data Privacy 3, pp:199-215
[2] Reijo M. Savola (2010) Towards a Risk-Drive Methodology for Priavcy Metrics Development. IEEE International conference on Social Computing/IEEE International Conference on Privacy, Security, Risk and Trust.

*for "discussion" read 'animated and heated arguments, a fury of writing on whiteboards, excursions to dig out academic papers, mathematics, coffee etc' - all great stuff :-)

Tuesday, 6 November 2012

Inherent Privacy

I've long had a suspicion that when building information systems and specifically when reviewing and auditing such systems that the techniques that need to be used and developed are effectively the very same tools and techniques that are used in the construction of safety-critical and fault tolerant systems.

As privacy is fast becoming the number one issue (if it isn't already) with regards to consumers' data, the amount of effort in consumer advocacy and legal aspects is outstripping the ability of the technical community to keep up with the required architectures, platforms, design, implementation and techniques for achieving this. Indeed there is some kind of arms race going on here and I'm not sure it really is in the benefit of the consumer.

For example, the Do Not Track technical community has come under criticism for not delivering a complete solution. I don't think this really is 100% the fault of the W3C or the good people developing these standards but rather the lack of
  1. understanding information systems (even in the theoretical sense) and 
  2. a lack of applicable and relevant tools and techniques for the software engineering community who at the end of the day are the ones who will end up writing the code that implements the legal directives in some for or other. But we digress.
Performing a little research we come across the term "Inherent Safety" (see [3]), defined:
Inherent safety is a concept particularly used in the chemical and process industries. An inherently safe process has a low level of danger even if things go wrong. It is used in contrast to safe systems where a high degree of hazard is controlled by protective systems. It should not be confused with intrinsic safety which is a particular technology for electrical systems in potentially flammable atmospheres. As perfect safety cannot be achieved, common practice is to talk about inherently safer design. “An inherently safer design is one that avoids hazards instead of controlling them, particularly by reducing the amount of hazardous material and the number of hazardous operations in the plant [3].

Taking this as a starting place I decided to have a go at rewriting the principles in the privacy contenxt as below - taking the extensions as proposed in [4] into consideration:

  • Minimize: reducing the amount of information/data present at any one time either in local storage, remote storage cache or network transfer
  • Substitute: replace one element of data with another of less privacy risk, eg: abstract GPS coordinates to city areas
  • Moderate: reduce the strength of a process to transform or analyse data, eg: reduce the amount of crossreferencing over a number of sets of data
  • Simplify: reduce the complexity of processes to control data, eg: single opt-out, one-time consent, simple questions (and not asking the user what resources an app should have access too..)
  • Error Tolerance: design the system/software and processes to deal with worst cases, eg: misbehaving applications sending too much data are blocked in some sense
  • Limit Effects: designing the system such that the effects of any leak of data is minimised, eg: properly anonymised or pseduo-anonymised data sets, secure transport layers, encryption etc

While admittedly not fully worked out I feel that this is more communicable and understandable to the software engineers that, say, Privacy by Design, which while lays out a good set of principles, is too high-level and abstract to map to the engineers and their day-to-day work. Actually I feel the problem with the Principles of Privacy by Design is that they can be (and are!) taken like the principles of the Agile Manifesto leading to some bizarre and extreme (and wrong!) ideas of what Agile Processes are - just take a look at some of the later writings of Ward Cunningham or Scott Ambler on the subject.

One aspect of the inherent safety idea that particularly appeals is that it is more grounded in engineering and practical development rather than being a set of principles. Indeed much of the grounding for this work comes from a very practical need and development through sound engineering practice espoused by Trevor Klenz. His quote "what you don't have, can't leak" applies equally to information as it does hazardous substances; Klentz's book [5] maybe should become required reading along with Solove and Nissenbaum.

As a further example, the HAZOP (Hazard and operability study) method(s) are purely formal methods in the true sense of the word in constructing a small, formally defined vocabulary and modelling standards - compare with process flow diagram in the chemical and process engineering with the data-flow diagram in software engineering for example.

I'd like to finish with a reference to HACCP (Hazard analysis and critical control points) which itself has a number of principles (seven seems to be a common number), but here's I'd like to concentrate for the moment on just two:

Principle 2: Identify critical control points. – A critical control point (CCP) is a point, step, or procedure in a food manufacturing process at which control can be applied and, as a result, a food safety hazard can be prevented, eliminated, or reduced to an acceptable level.

Where do we control the flow of information? Is it near the user or far from? Is it honour based and so on? The further from the source of the information, the greater the chance of leakage (both in information and chemical systems).

Principle 4: Establish critical control point monitoring requirements. – Monitoring activities are necessary to ensure that the process is under control at each critical control point. In the United States, the FSIS is requiring that each monitoring procedure and its frequency be listed in the HACCP plan.

This is something that I guess we're very bad with - do we ever monitor what information we keep in databases? Even the best intentions and best designs might still leak something, and this is especially true when working with large sets of information that can be cross referenced and fingerprinted. Maybe we should consider some kinds of information to be analogous to bacteria or viruses in a food handling situation?

So, just an example of from where we should be getting the basis for the tools and techniques and principles of really engineering for privacy. I stand by my assertion that in order to engineer information system correctly for privacy we must consider those systems to be safety-critical and treat them accordingly. I'll discuss counterarguments and how we get such techniques into our "agile" software engineering processes later.


[1] Stefan Kremp, European Commissioner concerned about "Do Not Track" standard, The H Open. 12 October 2012

[2] Claire Davenport, Tech standards body diluting Web privacy: EU official, Reuters, 10 October 2012

[3] Heikkilä, Anna-Mari. Inherent safety in process plant design. An index-based approach. Espoo 1999, Technical Research Centre of Finland, VTT Publications 384. ISBN 951-38-5371-3

[4] Khan, F. I. & Amoyette, P. R., (2003) Canadian Journal of Chemical Engineering vol 81 pp 2-16 How to make inherent safety practice a reality

[5] Kletz, T. A., (1991) Plant Design for Safety – A User-Friendly Approach, Hemisphere, New York