It is worth mentioning that the definition of PII is currently being addressed by the EU Article 29 Working Party which will eventually bring additional clarification at a legal level beyond that existing the the current privacy directives. Ultimately however, the implementation of privacy is in the hands of software and system architects, designers and programmers.
I'll start with a (personal) definition: data is personally identifiable if, given any record of data, are there fields that contain data that can be linked with a unique person, or, a sufficiently small group with a common, unique attribute or characteristic.
We are required then to look at a number of issues:
- what identifiers or identifying data is present
- how linkable the data is to a unique person
- how traceable a unique person is through the data set
The notion of a unique person does not necessarily mean that an actual human being can be identified but rather can be traced, unambiguously through the data. For example, an account identifier (eg: username) identifies a unique person but not a unique human being - this is an important, if subtle distinction.
We can construct a simple model of identifiers and how these identifiers relate to each other and to "real world" concepts - some discussion on this was made in earlier articles. These models of identifiers and their relationships and semantics are critical to correct understanding of privacy and analysis of data. Remarkably this kind of modelling is very rarely made and very rarely understood or appreciated - at least until time comes to cross-reference data and these models are missing (cf: semantic isolation).
|Example Semantic Model of Identifier Relationships|
Within these models as seen earlier, we can map a "semantic continuum" from identifiers that are highly linkable to unique and even identified persons to those which do not.
Further along this continuum we have identifiers such as various forms of IMEI, telephone number, IP addresses and other device identifiers. Care must be made in that while these identifiers are not highly linkable to a unique person, devices and especially mobile devices are typically used by a single unique person leading to a high degree of inferred linkability.
Device addresses such as IP addresses have come under considerable scrutiny regarding whether they do identify a person. In cases where an IP address of a router has been used to identify is someone has been downloading copyrighted or other illegal material is problematical for the reasons described earlier regarding linkability. In this specific case of IP addresses, network address translation, proxies and obfuscation/hiding mechanisms such as Tor complicate and minimise linkability.
As we progress further along we reach the application and session identifiers. Certainly application identifiers if they link to applications and not individually deployed instances of applications are not PII unless the application has a very limited user base: an example of a sufficiently small, group with common characteristics. For example, an identifier such as "SlighlyMiffedBirdsV1.0" used across a large number of deployments is very different from "SlighlyMiffedBirdsV1.0_xyz" where xyz is issued uniquely to each download of that application. Another very good example of this kind of personalisation of low linkability identifiers is the user agent string used to identify web browsers.
Session identifiers ostensibly do not link to a person but can reveal a unique person's behaviour. On their own they do not constitute PII. However session identifiers are invariably used in combination with other identifiers which does increase the linkability significantly. Session identifiers are highly traceable in that sessions are often very short with respect to other identifiers - capture enough sessions and one can fingerprint or infer common behaviour of individual persons.
When evaluating PII, Identifiers are often taken in isolation and analysis made there. This is one of the main problems in evaluating PII: Identifiers rarely exist in isolation and a combination of identifiers together reveals a unique identity.
Just dealing with identifiers alone, and even deciding what identifiers are in use provides the core of deciding whether a given data set PII. However identifiers provide the linkability, they do not provide tracability which is given through the temporal components which is often, if not, invariably present in data sets. We will deal with other dimensions of data later where we look deeper into temporal, location and content data. Furthermore we will also look at how data can be transformed into other types of data to improve traceability and linkability.