Wednesday, 28 May 2014

How much data?!?!

I took part as one of the speakers in a presentation about analytics today; explaining how data is collected through instrumentation of applications, web pages etc, to an audience who are not familiar with the intricacies of data collection and analytics.

We had a brief discussion about identifiers and what identifiers actually are which was enlightening and hopefully will have prevented a few errors later on. This bears explaining briefly: an identifier is rarely a single field, but should be considered any one of the subsets of the whole record. There are caveats there of course, some fields can't be used as part of some compound identifier, but the point here was to emphasis that you need to examine the whole record not just individual fields in isolation.

The bulk of the talk however introduced from where data comes from. For example if we instrument an application such that a particular action is collected, then we're not just collecting an instance of that action but also whatever contextual data provided by the instrumentation and the data from the traffic or transport layer. This came as a surprise that there is so much information available via the transport/traffic layers:

Said meta-data includes location, device/application/session identifiers, browser and environment details and so on, and so on...

Furthermore data can be cross-referenced with other data after collection. A canonical example is geolocation over IP addresses to provide information about location. Consider the case where a user switches off the location services on his or her mobile device; location can still be inferred later in the analytics process to a surprisingly high-level of accuracy.

If data is collected over time, then even though we are not collecting specific latitude-longitude coordinates we are collecting data about movements of a single, unique human being; even though no `explicit' location collection seems to be being made. If you find that somewhat disturbing, consider what happens every time you pay with a credit card or use a store card.

Then of course there's the whole anonymisation process where once again we have to take into consideration not just what an identifier is, but the semantics of the data, the granularity etc. Only then can we obtain an anonymous data set. Such a data set can be shared publicly...or maybe not as we saw in a previous posting.  

Even when one starts tokenising and suppressing fields, the k-anonymity remains remarkably low, typically with more than 70% of the records remaining unique within that dataset. Arguments about the usefulness of k-anonymity notwithstanding - on the other hand it is one of the few privacy metrics we have,

So, the lesson here is rather simple, you're collected a massive amount more than you really think.

The next surprise was how tricky or "interesting" this becomes when developing a privacy policy that contains all the necessary details about data collection, meta-data collection, traffic data collection; and then the uses to which that data is put, whether it is primary or secondary collection and so on.

Friday, 23 May 2014

Surgical privacy: Information Handling in an Infectious Environment

What has privacy engineering, data flow modelling and analysis got to do with how infectious materials and the sterile field are handled in medical situations?  Are there things we can learn by exploiting by drawing an analogy between these seemingly different fields?

We've discussed this subject earlier and a few links can be found here. Indeed privacy engineering has a lot to learn from analogous environments such as aviation, medicine, anaesthesia, chemical engineering and so on; the commonality here is that those environments understood they had to take a whole systems approach rather than relying upon a top-down driven approach or relying upon embedding the semantics of the area in one selected discipline.

Tuesday, 20 May 2014

Foundations of Privacy - Yet Another Idea

Talking with a colleague about yesterday's post on "a" foundation for privacy, or privacy engineering, he complained that the model wasn't complete. Of course the structuring is just one possible manifestation and others can be put together to take into consideration other views, or to provide a semantics of privacy in differing domains. For example, complete with semantic gaps, we might have a model which presents privacy law and policies in terms of economic theory which in turn is grounded in mathematics:

Then place the two models side-by-side and "carve" along the various tools, structures, theories etc that each uses and note the commonalities and differences, and then try to reconcile those.

The real challenge here is to decompose each of those areas into those theories, tools etc that are required to properly express each level. Then for each of those areas such as listed earlier,eg: type theory, programming, data flow, entropy etc, map each of these together. For example, a privacy policy might talk about anonymity and in turn anonymity of a data set can be given a semantics in terms of entropy.

Actually this is where the real details are embedded and we the levels as we have depicted them are vague, fuzzy classifications for convenience of grouping  these together.

Monday, 19 May 2014

Foundations of Privacy - Another Idea

This got triggered by a post on LinkedIn about what a degree in privacy might contain. I've certainly thought about this before, at least in terms of software engineering, and even have a whole course that could be taken over a semester ready to go.

Aside: CMU has the "World's First Privacy Engineering Course": a Master of Science in Information Technology—Privacy Engineering (MSIT-PE) degree. So, close, but a major university here in Finland turned down the chance to create something similar a few years back...

That aside, I've been wondering about how to present they various levels of things we need to consider to properly define privacy and put it on strong foundations. Though in the guise of information theory we already have this, though admittedly Shannon's seminal work from the 1930's is maybe a little too deep. On the other hand understanding concepts such as channels, entropy are fundamental building blocks, so maybe they should be there along with privacy law - now that would make some course!

Even just sketching out areas to present and what might be contained about this, even if a linear map from morality to mathematics is too constraining?

There are missing bits - we still have a  semantic gap between the "legal world" and the "engineering world"; parts that I'm hoping that things such as the many conferences, academic works and books such as the excellent Privacy Engineer's Manifesto and Privacy Engineering will play a role in defining. Maybe the semantic gap goes away once we start looking at there even a semantic gap? 

However, imagine for a moment starting anywhere in this stack and working up and down and keeping everything linked together in the context of privacy and information security. Imagine seeing the link between EU privacy laws and type theory, or between the construction of policies and entropy, the algebra of HIPAA, a side course in homotopy type theory and privacy...maybe with that last one I'm getting carried away, but, this is exactly what we need to have in place.

Each layer provides the semantics to the layer above - what do our morals and ethics means in terms of formalised laws, what do laws mean in terms of policies, what do policies mean in terms of software engineering structures, and down to the core mathematics and algebras of information.

Privacy and privacy engineering in particular almost has everything: law, algebra, morals, ethics, semantics, policy, software, entropy, information, data, BigData, Semantic Web etc etc etc. Furthermore, we have links to areas such as security, cryptography, economic theory etc!

Aren't these the very things any practitioner of privacy (engineering) should know, or at least have knowledge of? Imagine if lawyers understood information theory and semantics, and, software engineers understood law? 

OK, so there might be various ways of putting this stack together, competing theories of privacy etc, but that would be the real beauty here - a complete theory of privacy from the core mathematics through physics, computation, type theory, software engineering, policies, law and even ethics and morals.

But again, no more naivety, no more terminological or ontological confusions, policies and laws being traceable right down to the computation structures and code. Quite a tall order, but such a course bringing all these together really would be wonderful...

And wouldn't that be something!

An Access Control Paradox

The canonical case for data flow and privacy is some data collection from a set of identifiable individuals and generate insights (formerly called reports) about these. In order to protect privacy we will apply the necessary security and access controls and anonymisation of log files as necessary.

Let's consider the case where where generate a number of reports, and we'll order them according to some metric of their information content and specifically how easy or possible it is to re-identify the original sources.

Consider the system below, we collect from a user their user ID, device ID and location - this is some kind of tracking application, or for that matter, any kind of application we typically have on our mobile devices, eg: something for social media, photo sharing etc...

We've taken necessary precautions for privacy - we'll assume there's notice and consent given - in that the user's data is passed using a secure channel into our system. Process of this data takes place and we generate two reports:
  1. The first containing specific data about the user
  2. The second using some anonymous ID associated with certain event data for logging purposes only. This report is very obviously anonymous!
For additional security purposes we'll even restrict access to the former because it contains PII - but the second which is anonymous doesn't need such protection.

In many cases this is considered sufficient - we've the notice and consent and all necessary access controls and channel security. Protecting the report or file with the sensitive data in it is a given. But now the less sensitive data is often forgotten in all of this:
  • How is the identifier generated?
  • How granular is the time stamp?
  • What does the "event" actually contain?
  • Who has access?
  • How is this all secured?
Is the identifier some compound of data, hashed and salted, for example:
salt = "thesystem";id = sha256( deviceId + userid + salt);
This would at least allow analysis over unique user+device combinations and the salt, if specific to this logfile or system, then restricts matching to this log file only. Assuming of course the salt isn't know outside of here. 

The timestamp is of less importance but if of very high granularity would prevent the sequencing of events.

The contents of the event are always interesting - what data is stored there? What needs to be and how? If this is some debug log then there's probably just as much here as there is in the report containing the PII. Often it might just be stack traces (with or without parameters), or memory dumps - both of which contain interesting data, even if it is just a pointer to where a weakness in the system might exist.

Now come the questions of who has access and how is this secured? Given that such a report has interesting content shouldn't this be as secure as the report containing specific and identifiable user data? If there's some shared common knowledge could rainbow tables of hashes etc be constructed?

Consider this situation:

Where two separate systems exist, but there exists a common path between these systems which can be exploited because access control wasn't considered necessary for such "low grade", non-personal data.

Any common path is the precursor to de-anonymisation of data.

This might seem to be a rather trivial situation, except that such shared access and common knowledge of things such as salts, keys etc exist in most companies, large and small. In the latter it is often hard to avoid. Mechanisms such as employee contracts and awareness training actually do very little to solve this problem as they aren't designed to address or even understand this problem.

And here lies the paradox of access control: while we guard reports, files, datasets containing PII, we fail to address the same when working with anonymous data - whatever anonymous means.

Monday, 12 May 2014

Privacy and Big Data in Medicine

A short article by myself on the subject of privacy in medicine was just published in the web magazine Britain's Nurses. Quite an experience writing for a very different audience than software engineers, but extremely interesting to note the similarities between the domains.

When it comes to privacy, one of the seemingly infinite problems we face is how to develop the techniques, tools and technologies in our respective domains. Here again we have the choice of reinventing the wheel or looking to different domains and use their knowledge and experiences. This latter route is the much preferred but rarely taken.

So for the moment, I'll take the chance to look back on previous articles that draw lessons from other domains:
Domains such as medicine, civil engineering and especially aviation have been through this process and as information rises in value - that is the economic effects of a data breach or loss of consumer confidence - reach levels where companies will figuratively crash, so the need to take in these learnings and treat information handling as any other element in a safety-critical system.

Finally the article I mentioned: Privacy in Digital Health, 12 May 2014, Britain's Nurses

Thursday, 8 May 2014

Checklists and Design by Contract

One of the problems I am having with checklists is that they are often, or nearly always, confused with processes: "this is the list of steps we have to do and then you tick them off and all is well" mentality. This is probably why in some cases checklists have been renamed "aide memoirs" [1] and why their use and implementation is so misunderstood.

In the case of aviation or the surgical checklists these do not signify whether it is "safe" to take-off or start whatever procedure but as a reminder to the practitioner and supporting team that they have reached a place where they need to check on their status and progress. The decision to go or no-go is not the remit of the checklist. For example, once a checklist is complete a pilot is free to choose whether to take-off or not irrespective of the answers given to the items on the checklist (cf: [2]).

This got me thinking in that there are some similarities to design-by-contract and this could be used to explain checklists better possibly. For example consider the function to take-off (written in pseudo Eiffel fragments [3]):

         -- get the throttle position, brake status etc and spool-up engines

can be called whenever, there is no restriction and this is how it was, until an aircraft crash in the 1930's triggered the development of checklists in aviation. So now we have:

         -- get the throttle position, brake status etc and spool-up engines
          checklist_complete = True

and in more modern aircraft this is supplemented by features to specifically check on the aircraft status

         -- get the throttle position, brake status etc and spool-up engines
          checklist_complete = True
        if flaps < 10 then 

or even:

         -- get the throttle position, brake status etc and spool-up engines
          checklist_complete = True
          flaps > 10
          mode = GroundMode
          mode = FlightMode

What you actually see are specific checks from the checklists being incorporated into the basic protection mechanisms of the aircraft functionality. This is analogous to what we might see in a process, for example below we can see the implementation of functionality to encode a project approval checklist into some approval function:

         securityReview.status = Completed
         privacyReview.statuse= Completed
         continuityReview.status = Completed
         performanceReview.status = Completed
         architecturalReview.status = Completed

Now we have said nothing about how the particular reviews were actually made or whether the quality of their results were sufficient. This brings us to the next question of the qualitative part of a checklist and deciding what to expose. Here we have three options:

  1. completion
  2. warnings
  3. show stopping preconditions

The first is as explained above, the second and third offer us a choice about how we expose and act upon the information gained through the checklist. Consider a privacy or information content review of system, we would hope that specific aspects are specifically required, while others are just warnings:

         privacyReviewStatus = Completed
         privacyReview.pciData = False
         privacyReview.healthData = False
         if privacyReview.dataFlowModelComplete = False then warn("Incomplete DFDs!") end

And we can get even more complex and expose more of the checklist contents as necessary.

The main point here is that if we draw an analogy with programming, some aspects of checklists can be more easily explained. Firstly the basic checklist maxim is:

All the items on a checklist MUST be checked.

then we should be in a place to make a decision based on the following "procedure"

  1. Are all individual items in their respective parameter boundaries?
  2. Are all the parameters taken as a whole indicating that we are in a state that is considered to be within our definition of "safe" to proceed to the next state?
  3. Final question: Go or No-Go based on what we know from the two questions above?
Of course, we have glossed over some of the practical implementations and cultural aspects such as team work, decision making and cross-referencing, but what we have described is some of the philosophy and implementation of checklists in a more familiar to some programming context.


[1] Great Ormond Street Hospital did this according to one BBC (I think) documentary.
[2] Spanair Flight 5022