Wednesday, 11 December 2013

Earth and Moon from Juno

Somewhat stunning (understatment!) video made from Juno's low resolution camera during its Earth fly-by back in October:

Thanks to Phil Plait of Bad Astronomy once again...and just to think that every human that has ever lived existed in and no human has even travelled beyond the frames in that video...

Monday, 9 December 2013

An Academic Survey

I received this morning an email asking to me participate in a survey about "perceptions scientists have of information providers in scientific, technical and medical research fields.."

I'm always a little suspicious of such things but this does seem genuine-ish, if a little badly written; actually I was wondering if they were going to ask me to be a proxy to transfer 50 million USD from an account in the Central African Republic to somewhere on behalf of the Emporer of Botswana?

Do they really have to be so "secretive" in the wording too? I wonder what paper they are referring to, I published quite a few in 2010? What was the "careful selection criteria?"

Here's the mail:

Dear Dr. Oliver ,

We are contacting you because you are the corresponding author on a paper that was published in 2010. We are conducting a survey about the perceptions scientists have of information providers in scientific, technical and medical research fields. As only a carefully selected sample of scientists and practitioners have been chosen for the study, your feedback is very valuable to us.

The survey is being conducted by a scientific, technical and medical publisher who will be revealed at the end of the survey. Under the terms of the Market Research Society Code of Conduct, it will not lead to any sales follow up and no individual (or organization) will be identified. Your results will be kept confidential and used only for research purposes.

We would be grateful if you could spare approximately 15 minutes to complete this survey. The closing date for the survey is 23rd December 2013.

To start the survey, please click ONCE on the link below (or paste it into your browser):

[link redacted]

Please do not reply to this e-mail as the inbox is not monitored. If you are having trouble with this survey you can let us know [redacted]) and we will address any technical problems as quickly as we can.

Thank you very much for your time, we really value your input.

It mentions the "Market Research Society Code of Conduct" and ensures me that my data will only be used for the primary purposes of the survey; no marketing etc...The company sending this is Confirmit, a Norwegian based company that conducts market research - just wonder why they're quoting the MRSC code of conduct presumably given that the MRSC is a UK organisation as far as I can tell.

I'd actually like to know what their privacy policy is, and how they interact with their customers...for example, do they anonymise by hashing unique identifiers? Wwhat is the full set of data that does flow to the customer? What to Confirmit keep and for how long?

Anyway, I contacted MRSC to find out whether this is genuine or not and who is actually using this data.

A few other searches suggests that this might be a request by Elsevier to find out why authors are so upset with them. I guess it isn't a secret that I frequent the category theory lists and there might be one or two there who might not be friends with Elsevier....

So, went to Elsevier's web pages to find some contact details to find out whether they were asking and again whether this is something genuine or some evil spam...clicking on "Help & Contact" gives, well....

I'll let you know how I get on if I get any replies....

Sunday, 24 November 2013

Privacy, Evidence Trails and a Change in Terminology?

One of the main aspects of personal [information] privacy is that much of the topic is that other parties would not collect nor perform any analysis of your data. The trouble is that this argument is often made in isolation, in that it somewhat assumes that the acts we perform by computer exist in a place where we can hide. For example, what someone does behind closed doors usually remains private. But, if that act is made in a public place, say, in the middle of the street by default whatever is done is not private - even if we hoped no-one saw.

Anything and everything we do on the internet is in public by default. When we perform things in public, then other people may or will see, find out and perform their own analysis to form a profile of you.

Many privacy enhancing technologies are akin to standing in the middle of a busy street and shouting "don't look!". Even if everyone looks away, more often than not there is a whole raft of other evidence to show what you've been doing.

Admittedly most of the time nobody really cares nor are actually looking in the first place. Though as it has been found out recently (and this really isn't a surprise) that some such as the NSA and GCHQ are continually watching. Even the advertisers don't really care that much; their main interest is trying to categorise you to ship a generic advertisement - and advertisers are often really easy to game...

If we really do want privacy on the internet then rather than concentrating on how to be private (or pretending that we are), we need to concentrate on how to reduce the evidence trail that we leave. Such evidence is in the form of web logs, search queries, location traces from your navigator, tweets, Facebook postings etc.

Once we have understood what crumbs of evidence is being left, we can start exploring all the side avenues where data flows (leaks) and the points where data can be extracted surreptitiously. We can also examine what data we do want released, or have no choice about.

At this moment, I don't really see a good debate about this, at least not at a technical level though there are some great tools such as Ghostery that assist in this. Certainly there is little discussion at a fundamental level which would really help us define what privacy really is.

I personally tend to take the view at the moment that privacy might even be the wrong term, or at best, somewhat a misleading term.

On the internet every detail of what we do is potentially public and can be used for good as well as evil (whatever those terms actually mean), our job as privacy professionals is to make that journey as safe as possible, hence the use of the term "information safety" to better describe what we do.

Friday, 22 November 2013

Semiotic Analysis of a Privacy Policy

Pretty much every service has a privacy policy attached to it; these policies state the data collection, usage, purpose and expectations that the customer has to agree with before using that said service. But at another level they also attempt to signify (that's going to be a key word) that the consumer can trust the company providing the service at some level. Ok, so there's been huge amounts of press about certain social media and search service provides "abusing" this trust and so on, but we still use the services provided by those companies.

So this gets me thinking, when a privacy policy is written, could we analyse that text to understand better the motives and expectations from both the customer and the service provider perspective? Effectively can we make a semiotic analysis of a privacy policy.

What would we gain by this? It is imperative that any texts of this nature portray the right image to the consumer, thus this can be used in the drafting of such a text to ensure that this this right image is correctly portrayed. For example, the oft seen statement:

"Your privacy is important to us"

is a sign in the semiotic sense, and in this case probably an 'icon' in its near universal usage. Signs are a relationship between the 'object' and 'interpretant', respectively the subject matter at hand and the clarified meaning respectively.

Pierce's Semiotic Trangle
The object may be that we (the writer of the statement) are trying to state a matter of fact about how trust worthy we are, or at least we want to emphasise that we can be trusted.

The interpretant of this, if we are the customer, can of course vary from total trust to utter cynicism. I guess of late the latter interpretation tends to be true. Understanding the variation in interpretants is a clear method for understanding what is being conveyed by the policy itself and whether the right impression is being given to the consumer.

At a very granular level the whole policy itself is a sign and the very existance of that policy and its structure, is it long and full of legalese or short and simple? Then there's the content (as described above) which may or may not depend upon the size of the in the World's Worst Privacy Policy.



Found this paper while researching for this: Philippe Codognet's THE SEMIOTICS OF THE WEB, it starts with a quote:
I am not sure that the web weaved by Persephone in this Orphic tale, cited in exergue of Michel Serres’ La communication , is what we are currently used to call the World Wide Web. Our computer web on the internet is nevertheless akin Persephone’s in its aims : representing and covering the entire universe. Our learned ignorance is conceiving an infinite virtual world whose center is everywhere and circumference nowhere ...
Must admit, I find that very, very nice. Best I've got is getting quotes about existential crises and cosmological structures in a paper about the Semantic Web with Ora Lassila.

Tuesday, 19 November 2013

Losing Situational Awareness in Software Engineering

The cause of many aircraft accidents is attributed to loss of situational awareness. The reasons for such a situation are generally due to high workload, confusing or incorrect data and misinterpretation of data by the pilot.

The common manifestation of this is mode confusion where the pilot's expectation for a given situation (eg: approach) does not match the actual situation. This is often seen (though not exclusively) in highly automated environments, leading to the oft repeated anecdote:

A novice pilot will exclaim, "Hey! What's it doing now?", whereas an experienced pilot will exclaim, "Hey! It's doing that again!"

Aside: while this seems to be applied to Airbus’ FBW models, the above can be traced back to the 
American Airlines Childrenof the Magenta tutorial and specifically refers to the decidedly non-FBW Boeing 757/767 and Airbus A300 models, and not the modern computerized machines… [1]

This obviously isn't restricted to aircraft but also to many other environments; for example, consider the situation in a car when cruise control is deactivated due to an external situation not perceived by the driver. Is the car slowing due to road conditions, braking etc? Consider this in conjunction when you are expecting the car to slow down where the deactivation has the same effect as expected; this type of loss of situational awareness was seen in the Turkish 737 accident at Schipol.

In an auditing environment where we obtain a view horizontally across a company we too suffer from loss of situational awareness. In agile, fast environments with many simultaneous projects, we often fail to see the interactions between those projects.

Yet, understanding these links in non-functional areas such as security and privacy is absolutely critical to a clear and consistent application of policy and decisions.

A complicating factor we have seen is that projects change names, components of projects and data-sets are reused both dynamically and statically leading to identity confusion. Systems reside locally, in the cloud and elsewhere in the Universe, terminology is stretched to illogical extremes: big data and agile being two examples of this.  Simplicity is considered a weakness and complexity a sign of the hero developer and manager. 

In systems with safetycritical properties heroes are a bad thing.

In today's so-called agile, uber-innovative, risk-taking, fail fast, fail often, continuous deployment and development environments we are missing the very basic and I guess old fashioned exercise of communicating rigorously and simply what we're doing and reusing material that already exists.

Fail often, fail fast, keep making the same mistakes and grow that technical debt.

We need to build platforms around shared data, not functionality overlapping, vertical components with the siloed data mentality. This required formal communication and an emphasis on quality, not on the quantity of rehashing buzzwords from the current zeitgeist.

In major construction projects there is always a coordinator whose job it is to ensure that not only do individual experts communicate (eg: the plumbers, the electricians etc) but that their work is complimentary and that one group does not repeat the work or base that another team has put in place.

If software engineers built a house, one team would construct foundations by tearing down the walls that inconveniently stood atop of already built foundations, while another would build windows and doors while digging up the foundation as it is being constructed. A further team charged with installing the plumbing and electrics would first endeavor to invent copper, water and electricity...and all together losing awareness of the overall situation.

OK, I'm being hard on my fellow software engineers, but it is critical that we concentrate more on communication, common goals, less "competition" [2] and ensuring that our fellow software engineers are aware of each other's tasks.

As a final example, we will (and have!) seen situations in big data where analytics is severely compromised because we failed to communicate and decide upon common data models, common catalogs of where and what data exists.

So, like the pilot faced with a situation where he recognises he's losing his situational awareness, drop back down the automation and resort to flying by the old-fashioned methods and raw data.

The next question is, how do you recognise that you're losing situational awareness?

Wednesday, 13 November 2013

What does "unclassified" mean?

Previously we presented a simple security classification system consisting of four levels: secret, confidential, public and unclassified; with the latter case being the most restrictive.

While sounding counter-intuitive, a classification of "unclassified" simply means that no decision has been made. And if no decision on the actual classification has been made then it is possible that in the future that classification might be decided to be "secret". Which implies that if "unclassified" needs to be at least as strong as the highest explicit classification.

Aside: Fans of Gödel might be wondering that how can be talk about classifying something as being unclassified, which in turn is higher in some sense than the highest classification. Simply, "unclassified" is a label which is is attached to everything before the formal process of attaching on of the other labels is made.

Let's start with an example: if Alice writes a document then this document by default is unclassified. Typically that document's handling falls under the responsibility of Alice alone.

If Alice gives that document to Bob, then Bob must handle that document according to Alice's specific instructions.

By implication Alice has chosen a particular security classification. There are two choices:

  1. Either an explicit classification is given, eg: secret, confidential, public
  2. Or, no classification is given an Alice remains the authority for instructions on how to handle that document
In the latter case Alice's instructions may be tighter than the highest, explicit classification, which implies that unclassified is more restricting than, say, secret.

If Bob passes the document to Eve (or to the whole company by a reply-all) then we have a data breach. The document never implicitly becomes public through this means; though over time the document might become public knowledge but still remain officially secret. For example, if an employee of a company leaks future product specifications to the media, even though they are now effectively public, the employee (and others) who handled the data would still fall under whatever repercussions leaking secret or confidential data implies.

Still this is awkward to reconcile, so we need more structure here to understand what unclassified and the other classifications mean.

We must therefore apply to a notion of authority: all documents must have an owner - this is basic document handling. That owner

  1. Either assigns an explicit security classification, and all handlers of that document refer to the standard handling procedures for that security classification: referring to the security classifications standard as the authority
  2. Or, keeps the document as being unclassified and makes themselves the authority for rules on how to handle that document
The latter also comes with the implication that the owner of the document here is also responsible for ensuring that whatever handling rules are implied, these are consistent with the contents of the document. For example, if the document contains sensitive data then in our example, Alice is responsible for ensuring that the rules that come from her authority are as at least as strict as the highest implied security classification.

In summary, if a document or data-set is unclassified then the owner of that document is the authority deciding on what the handling rules are and that by default the rules must be at least* as strict has the highest explicit security category.

*In our classification we have the relationship:

Public < Confidential < Secret

with the statement above saying:

Public < Confidential < Secret <= Unclassified

As a final point, if Alice decides that her rules are weaker than say, confidential, but stronger than public, then it makes sense to take the next highest level as the explicit classification, ie: confidential. This way we establish the policy that all documents must eventually be explicitly classified.

Wednesday, 6 November 2013

Classifying Data Flow Processes

In previous posts (here, here, here and especially here) we presented various aspects for classifying data nominally being sent over data flows or channels between processes. Now we turn to the processes themselves and tie the classification of channels together.

Consider the data flow shown here (from the earlier article on measurement, from where we take the colour scheme of the flows) where we show the movement of, say, location information:

Notice how data flows via a process called "Anonymisation" (whatever that word actually means!) to the advertiser. During the anonymisation the location data is cleaned to a degree that business or our privacy policy allows - such boundaries are very context dependent.

This gives the first type of process, one that reduces the information content.

The other type of processes we see are those that filter data and those that combine or cross-reference data.

Filtering is where data is extracted from an opaque or more complex set of information into something simpler. For example if we feed some content, eg: a picture, then we might have processes that filter out the location data from said pictures.

Cross-referencing means that two or more channels are combined to produce a third channel containing information from the first two. A good example of this is geo-locating IP addresses which takes in as one source an IP address and another a geographical location look-up table. Consider the example below:

which combines data from secondary and primary sources to produce reports for billing, business etc.

In the above case we probably wish to investigate the whole set of processes that are actually taking place and make a considerable amount of decomposition of the process.

When combined with the classifications on the channels, particularly information and security classes we can make some substantial reasoning. For example, if there is a mismatch in information content and/or security classifications then we have problems; similarly if some of these are transported over insecure media.

To summarise, in earlier articles we explained how data itself may be classified, and here how processes may be classified according to a simple scheme:

  • Anonymising
  • Filtering
  • Cross-Referencing

In a later article I'll go more into the decomposition and refinement of channels and processes.

Friday, 1 November 2013

Measurement and Metrics for Information Privacy

We have already discussed an analogy for understanding and implicitly measuring the information content over a data channel. The idea that information is an "infectious agent" is quite a powerful analogy in the sense that it allows us better to understand the consequences of information processing and the distribution of that data better, viz:
  • Any body of data which is containing certain amounts and kinds of sensitive data we can consider to be non-sterile
  • Information which is truly anonymous is sterile
  • Mixing two sets of information produces a single set of new information which is as at least as unclean as the dirtiest set of data mixed, and usually more so!
  • The higher the security classification the dirtier the information
Now we can using our information classification earlier introduced we can further refine our understanding and get some kind of metric over the information content.

Ley us classify information content into seven basic categories: financial, health, location, personal, time, identifiers and content. Just knowing what kinds of data are present as we have already discussed gives us a mechanism to pinpoint where more investigation is required.

We can then go further an pick out particular data flows for further investigation and then map this to some metric of contamination:

For example, transporting location data has a particular set of concerns, enough to make any privacy professional nervous at the least! However if we examine a particular data flow or store we can evaluate what is really happening, for example, transmitting country level data is a lot less invasive than highly accurate latitude and longitude.

Now one might ask why not deal with the accurate data initially? The reasons are that we might not have access to that accurate, field-level data, we might not want to deal with the specifics at a given point in time, specific design decisions might not have been made etc.

Furthermore, for each of the seven categories we can give some "average" weighting and abstract away from specific details which might just complicate any discussion.

Because we have a measure, we can calculate and compare over that measurement. For example, if we have a data channel carrying a number of identifiers (eg: IP, DeviceID, UserID) we can take the maximum of these as being indicative of the sensitivity of the whole channel for that aspect.

We can compare two channels, or two design decisions, for example, a channel carrying an applicationID is less sensitive (or contaminated) than one carrying device identifiers.

We also can construct a vector over the whole channel composed out of the seven dimensionsb above to give a further way of comparing and reasoning about the level of contamination or sensitivity:
| (FIN=0,HLT=0,LOC=10,PER=8,TIM=3,ID=7,CONT=0) | 
| (FIN=3,HLT=2,LOC=4,PER=4,TIM=2,ID=9,CONT=2) |
for some numerical values gien to each category. Arriving at these values will be specific to a given wider context and then the weighting given to each, but there is one measure which can be used to ground all this, and that is of information entropy, or, how identifiying the contents are to a given, unique human being. A great example of this is given at the EFF's Panopticlick pages.

We've only spoken about a single data flow at the moment, however the typical scenario is for reasoning over longer flows, for example, we might have our infrastructure set up as below*

In this example we might look at all the instances where AppID and Location are specified together and use a colour coding such that:
  • Black: unknown/irrelevant
  • Red: high degree of contamination, both AppID and Location unhashed and accurate respectively
  • Yellow: some degree of contamination, AppID may be hashed(+salt?) or Location at city level or better
  • Green: AppID randomised over time, hashed, salted and Location at country level or better
Immediately readable from the above flow are our points of concern which need to be investigated, particular the flows from the analytics processing via the reports storage and to direct marketing. It is also easy to see that there might be a concern with the flow to business usages, what kinds of queries ensure that the flow here is less contaminated than the actual reports storage itself?

There are a number of points we have not yet discussed, such as that some kinds of data can be transformed into a different type. For example some kinds of content such as pictures inherently contain device identifiers, locations etc. Indeed the weighting for some categories such as content might be very much higher than that of identifiers for example - unless the investigation is made. Indeed it does become almost a trivial exercise for some to explicitly hide sensitive information inside opaque data such as images and not declare then when a privacy audit is made.

To summarise, we've a simple mechanism for evaluating information content over a system in quantative terms, complete with a refinement mechanism that allows us to work at many levels of abstraction depending upon situation and context. Indeed, what we are doing is explcitly formalising and externalising our skills when performing such evaluations, and through analogies such as "infection control" providing means for easing the education of professionals outside of the privacy area.

*This is not indicative of any system real, living or dead, but just an example configuration.

Wednesday, 30 October 2013

Diagrams Research

For a number of years I and some colleagues have worked closely with the University of Brighton's Visual Modelling Group using their work on diagrammatic methods of modelling and reasoning. One of the areas where we've had quite a nice success is in modelling aspects of information privacy [1] with some particularly useful and beautiful and natural representations of complex ideas and concepts.

Another area has been in the development of ontologies and classification systems - something quite critical in the area of information management and privacy. Some of this dates back to work we made with the M3 project and the whole idea of SmartSpaces incorporating the best of the Semantic Web, Big Data etc.

We've gained quite a considerable amount of value out of this relatively, simple industrial-academic partnership. A small amount of funding, no major dictatorial project plans but just letting the project and work develop naturally, or even if you like, in an agile manner, produces some excellent, useful and mutually beneficial results.

Indeed not having a project plan but just a clearly defined set of things that we need addresses and solved (or just tackled - many minds with differing points of view really does help!) means that both partners: the industrial and the academic, can get on with the work rather than battling an artificial project plan which becomes increasingly irrelevant and industrial focus and academic ideas change over time. Work continues with more ontology engineering in the OntoED project.

  1. I. Oliver, J. Howse, G. Stapleton. Protecting Privacy: Towards a Visual Framework for Handling End-User Data. IEEE Symposium on Visual Languages and Human-Centric Computing, San Jose, USA, IEEE, September, to appear, 2013.
  2. I. Oliver, J. Howse, G. Stapleton, E. Nuutila, S. Torma. Visualising and Specifying Ontologies using Diagrammatic Logics. In proceedings of 5th Australasian Ontologies Workshop, Melboune, Australia, CRPIT vol. 112, December, pages 37-47, 2009. Awarded Best Paper
  3. J. Howse, S. Schuman, G. Stapleton, I. Oliver. Diagrammatic Formal Specification of a Configuration Control Platform. 2009 Refinement Workshop, pages 87-104, ENTCS, November, 2009.
  4. I. Oliver, J. Howse, G. Stapleton, E. Nuutila, S. Torma. A Proposed Diagrammatic Logic for Ontology Specification and Visualization. 8th International Semantic Web Conference (Posters and Demos), October, 2009.
  5. J. Howse, G. Stapleton, I. Oliver. Visual Reasoning about Ontologies.International Semantic Web Conference, China, November, CEUR volume 658, pages 5-8, 2010.
  6. P. Chapman, G. Stapleton, J. Howse, I. Oliver. Deriving Sound Inference Rules for Concept Diagrams. IEEE Symposium on Visual Languages and Human-Centric Computing, Pittsburgh, USA, IEEE, September, pages 87-94, 2011.
  7. G. Stapleton, J. Howse, P. Chapman, I. Oliver, A. Delaney. What can Concept Diagrams Say? Accepted for 7th International Conference on the Theory and Application of Diagrams 2012, Springer, pages 291-293, 2012.
  8. G. Stapleton, J. Howse, P. Chapman, A. Delaney, J. Burton, I. Oliver.Formalizing Concept Diagrams. 19th International Conference on Distributed Multimedia Systems, International Workshop on Visual Languages and Computing, Knowledge Systems Institute, to appear 2013.

Wednesday, 23 October 2013

Security Classifications

We've introduced information classifiations, provenance, usage and purpose but so far neglected "security" classifications. These are the classic secret, confidential, public classifications so beloved of government organisations, Kafkaesque bureaucracies, James Bond's bosses etc.

Part of the ISO27000 standard for security directly addresses the need for a classification system to mark documents and data sets. How you set up your security classification is left open, though we tend to generally see four categories:
  • Secret
  • Confidential
  • Public
  • Unclassified
At least for everything other than Unclassified, the classifications tend to follow a strict ordering, though some very complex systems have a multitude of sub-classes which can be combined together.

For each level we are required to define what handling and storage procedures apply. For example:

  • May be distributed freely, for examples by placing on public websites, social media etc. Documents and data-sets can be stored unencrypted.

  • Only for distribution within the company. May not be stored on non-company machines or places on any removable media (memory sticks, CD etc). Documents and data-sets can be stored unencrypted unless they contain PII.

  • Only for distribution by a specific denoted set of persons. May not be stored on non-company machines or places on any removable media (memory sticks, CD etc). Documents and data-sets must  be stored encrypted and disposed of according the DoD standards.
Data that is marked unclassified should be treated as property of the author or authors of that document and not distributed. This would make unclassified a level higher than secret in our above classification.

A good maxim here is: Unclassified is the TOP level of security.

Sounds strange? Until document or data-set is formally classified how should it be handled?

Note in the above that we refer to the information classification of any data within a data set to further refine the encryption requirements for classified information. No classification system as described earlier exists alone, though ultimately they all end up being grounded to something in the security classification system. For example we can construct rules such as:
  • Location & Time => Confidential
  • User Provenance => Confidential
  • Operating System Provenance => Public
and applied to our earlier example:

we find that this channel and the handling of the data by the receiving component should conform to our confidential class requirements. A question here is that does a picture, location and time constitute PII?  Rules to the rescue again:
  • User Provenance & ID => PII
  • http or https => ID, Time
So we can infer that we need to protect the channel at least, either by using secure transport (https) or by encrypting the contents.

The observant might notice that ensuring protection of data as we have defined above for some social media services is not possible. This then provides a further constraint and a point to make an informed business decision. In this case anything that ends up at the target of this channel is Public by default, this means that we have to ensure that the source of this information, the user, understands that even though their data is being treated as confidential throughout our part of the system, the end-point does not conform to this enhanced protection. Then it becomes a choice on the user's part of whether they trust the target.

In the previous article about information contamination, we have an excellent example here. We need to consider the social media service as a "sterile" area while our data channel contains "dirty" or "contaminated" information. Someone needs to make the decision that this transfer can take place and in what form - invariably this is the source of the information, that is, the user of the application.

Does this mean that we could then reduce the protection levels on our side? Probably not, at least from the point of view that we wish to retain the user's trust by ensuring that we do our best to protect their data and task the responsible position of informing the user of what happens to their data once it is outside of our control.

Tuesday, 22 October 2013

Modelling Data Flow and Information Channels

Before we delve further into policies and more analysis of our models, I want to first take a small detour and look at the data channels in our models. We earlier explained that we could refine the various classifications on our channels down to fine grained rules, this is one kind of refinement. We can also refine the channel structure itself to make the various "conversations" between components clear.

Firstly, what is a channel? There's the mathematical explanation taken from [1]:
An information channel consists of an indexed family C = { f_i : A_i <-> C} i\in I of infomorphisms with a common codomain C, called the core of the channel.
Phew! Or a more everyday description that an information channel is the conversation between two elements such as persons, system components, applications, servers etc.

We also note that conversations tend to be directed in the direction that the information flows. We generally don't model the ack/nack type protocol communications.

Starting with our model from earlier:

We should and can refine this to explicitly distinguish the particular conversations that occur between the application and the back end.

While the two components communicate all this information, maybe even over the same implementation, we might wish to explicitly distinguish between the primary and secondary information flows. The reason could be due to differing consent mechanisms, or differing processing on the receiving end etc.

In the above we are explicitly denoting two different conversations. These conversations are logically separate and from the information usage point of view.

As we decompose our channels into the constituent, logically separate conversations we are also making decisions about how the system should keep apart the data transported over those conversations. Whether this translates into physical separation or however the logical differentiation is made is an architectural issue modelled and decided elsewhere.

As we shall see later when we decompose the processing elements in our data flow model we can track the flows and where those flows meet, diverge, cross-reference and infer points of possible contamination.


[1] Barwise and Seligman. (1997) Information Flow. Cambridge University Press.

Monday, 21 October 2013

Information as an Infectious Agent

Operating theatres are split into two parts:

  • the sterile field
  • the non-sterile surroundings

Any non-sterile item entering the sterile field renders it non-sterile; and stringent efforts and protocols [1,2] are made to ensure that this does not happen.

The protocols above extend via simply analogy [3,4] to information handling and information privacy.

  • Any body of data which is containing certain amounts and kinds of sensitive data we can consider to be non-sterile - assume for a moment that certain bits and bytes are infectious (great analogy!).
  • Everyone working with information is required to remain sterile and uncontaminated.
  • Information which is truly anonymous is sterile
  • Mixing two sets of information produces a single set of new information which is as at least as unclean as the dirtiest set of data mixed, and usually more so!
  • The higher the security classification the dirtier the information

We can extend this latter point to specific information types, eg: location, personal data, or certain kinds of usages and purposes, eg: data for advertising or secondary data and so on.

Extending our analogy further we can protect the sterile field in two ways:

  • ensuring that everyone in contact with the sterile field is sterile
  • ensuring that the equipment entering the sterile field is sterile

  • If two sets of data are to be mixed then ensure that the mixing occurs not in-situ but by generating a third data set kept separate from the two input sets
  • Data can be made more sterile by removing information content. But, be warned that certain kinds of obfuscation are not effective, eg: hashing or encryption of fields might just hide the information content of that field but not the information content of the whole data set [3]
  • Keep sterile and non-sterile data-sets apart, physically if possible
  • Ensure that sterile and non-sterile data-sets have differing access permissions. Ideally different sets of people with access
  • Clean up after yourself: secure data deletion, overwriting of memory, cache purges etc.

From a personnel point, in surgery precautions are made through restricting the persons inside the sterile field and even outside of this, basic precautions are taken in terms of protective clothing etc. While surgical attire might be overkill for office environments, the analogy here is that personnel with access to data have received the correct training and are aware of what data they can and can not use for various purposes.

In a surgical environment, everything entering and leaving the sterile field is checked and recorded. In an information systems environment this means logging of access so that when a breach of the sterile field occurs the route of the pathogen and its nature can be effectively tracked and cleaned.


[1] Infection Control Today - August 1, 2003 : Guidelines for Maintaining the Sterile Field
[2] Infection Control Today - November 7, 2006 - Best Practices in Maintaining the Sterile Field

Sunday, 20 October 2013

Data Aspects and Rules

In the previous post we introduced how to annotate data flows in order to understand better what data was being transported and how. In this post I will introduce further classifications expressed as aspects.

We already have transport and information class as a start; the further classifications we will introduce are:
  • Purpose
  • Usage
  • Provenance
Purpose is relatively straightforward, and consists of two classes: Primary and Secondary. These are defined in this previous posting.

Usage is remarkably hard to define and the categories tend to be quite context specific, though patterns do emerge. The base set of categories I tend to use are:
  • system provisioning - the data is being used to facilitate the running and management of the system providing the service, eg: logging, system administration etc.
  • service provisioning - the data is being used to facilitate the service itself; this means the data is necessary for the basic functionality of that service, or primary data.
  • advertising - the data is being used for advertising (tageted or otherwise), by the service provider or third party
  • marketing - the data is being used for direct marketing back to the source of the data
  • profiling - the data is being used to construct a profile of the user/consumer/customer. It might be useful in some cases to denote a subtype of this - CRM - to explicitly differentiate between "marketing"  and "internal business" profiling.
Some of the above tend to occur often together, for example, data for service provisioning is often also used for advertising and/or marketing too.

Provenance denotes the source of the information and is typically readable from the data-flow model itself. There does exist a proposed standard for provenance as defined by the W3C Provenance Working Group. It is however useful to denote for completeness purposes whether data has been collected from the consumer, generated through analytics over a set of data, from a library source etc.

We could enhance our earlier model thus:

As you can see, this starts to be quite cumbersome and the granularity is quite large. Though from the above we can already start to see some privacy issues arise.

The above granularity however is perfectly fine for a first model but to continue we do need to refine the model somewhat to better explain what is really happening. We can construct rules of the form:
  • "Info Class" for "Purpose" purpose used for "Usage"
for example taken from the above model:
  • Picture for Primary purpose used for Service Provisioning
  • Location for Primary purpose used for Service Provisioning
  • Time for Primary purpose used for Service Provisioning
  • Device Address for Secondary purpose used for System Provisioning
  • Location for Primary purpose used for Advertising
  • Location for Primary purpose used for Profiling
  • ...
and so on until we have exhausted all the combination we have, wish or require in our system. Note that some data comes from knowledge of our transport mechanism, in this case a device address (probably IP) from the use of http/s.

These rules now give us a fine grained understanding of what data is being used for what. In the above case, the flow to a social media provider, we might wish to query whether there are issues arising from the supply of location, especially as we might surmise that it is being used for profiling and advertising for example.

For each rule identified we are required to ask whether the source of that data in that particular data flow agrees to and understands where the flow goes, what data is transported and for what purposes; and then finally whether this is "correct" in terms of what we are ultimately promising to the consumer and according to law.

In later articles we will explore this analysis more formally and start also investigating security requirements, country requirements and higher level policy requirements such as safe harbour, PCI, SOX etc.

Sunday, 13 October 2013

Classifying Information and Data Flows

In the previous articles on data flow patterns and basic analysis of a data flow model we introduced a number of classifications and annotations to our model. Here we will explain two of these briefly:
  1. Data Flow Annotations
  2. Information Classification
Let's examine this particular data flow from our earlier example:

The first thing to notice is the data-flow annotation in angled brackets (mimicking the UML's stereotype notation) denoting the protocol or implementation used. It is fairly easy to come up with a comprehensive list of these, for example as a useful minimum set might be:
  • internal - meaning some API call over a system bus of some kind
  • http - using the HTTP protocol, eg: a REST call or similar 
  • https - using the HTTPS protocol
  • email - using email
and if necessary these can be combined to denote multiple protocols or possible future design decisions. Here I've written http/s as a shorthand.

Knowing this sets the bounds on the security of the connection, what logging might be taking place at the receiving end and also what kinds of data might be provided by the infrastructure, eg: IP addresses.

* * *

The second classification system we use is to denote what kinds of information are being carried over each data-flow. Again a simple classification structure can be constructed, for example, a minimal set might be:
  • Personal - information such as  home addresses, names, email, demographic data
  • Identifier - user identifiers, device identifiers, app IDs, session identifiers, IP or MAC addresses
  • Time - time points
  • Location - location information of any granularity, typically lat, long as supplied by GPS
  • Content - 'opaque' data such as text, pictures etc
Other classes such as Financial and Health might also be relevant in some systems.

Each of the above should be subclassed as necessary to represent specific kinds of data, for example, we have used the class Picture. The Personal and Identifier categories are quite rich in this respect.

Using high-level categories such as these affords us simplicity and avoids arguments about certain kinds of edge cases as might be seen with some kinds of identifiers. For example, using a hashed or so-called 'anonymous' identifier is still something within the Identifier class, just as much as an IMEI or IP address is. 

Note that we do no explicitly define what PII (personally identifiable information) is, but leave this as something to be inferred from the combination of information being carried both over and by the data flow in question.

* * *

Now that we have the information content and transport mechanisms made we can reason against constraints, risks and threats on our system, such as whether an unencrypted transport such as HTTP is suitable for carrying, in this case, location, time and the picture content; or would a secure connection be better? Then there is also the question of whether encrypting the contents and using HTTP?

We might have the specific requirements:
  • Data-flows containing Location must be over secured connection
  • Secured connections use either encrypted content or a secure protocol such as HTTPS of SFTP.
and translate these into requirements on our above system such as
  • The flow to any social media system must be over HTTPS
Some requirements and constraints might be very general, for example
  • Information of the Identifier class must be sent over secured connection
While the actual identifier itself might be a short-lived, randomly generated number with very little 'identifiability' (to a unique person), the above constraint might be too strong. Each retrenchment such as this can then be specifically evaluated for the additional introduced risk.

* * *

We have shown here is that by simple annotation of the data flow model according to a number of categories we can reason about what information the system is sending, to whom and how. This is the bare minimum for a reasonable privacy evaluation of any system.

Indeed even with the two above categories we can already construct a reasonably sophisticated and rigorous mapping and reasoning against our requirements and general system constraints. We can even as we briefly touched upon start some deeper analysis of specific risks introduced through retrenchments to these rules.

* * * 

The order in which things are classified is not necessarily important - we leave that to the development processes already in place. Having a model provides us with unambiguous information about the decisions made over various parts of the system - applying the inferences from these is the critical lesson to be taken into consideration.

We have other classifications still to discuss, such as security classifications (secret, confidential, public etc), provenance, usage, purpose, authentication mechanisms - these will be presented in forthcoming articles in more detail.

Constructing these classification systems might appear to be hard word; certainly it takes some effort to implement and ensure that they are active employed, but security standards such as ISO27000 do require this.

Thursday, 10 October 2013

Analysing Data Flow Models

In the previous post we introduced a pattern for the data-flows in and out of an application such as those found on your mobile phone or tablet. In this posting I want to expand on this pattern and explore various annotations to help us reason about how our application treats the information flowing through it.

Let's first introduce an example application, a photo sharing application. This app allows you to select a photo on your device and upload it to a social media account. Here's what it looks like to the user on a mobile device with rounded corners (as opposed to square corners which none of them seem to have):

It looks innocent enough but as we know there are many threats to user privacy even in the most innocent looking of places. So let's model what is really happening behind the scenes. We understand so far a number of things: the user supplies content and credentials for the services, these are stored locally for convenience, the app adds meta-data to the picture before uploading and sends information about the app's behaviour to the inventor of the app. We might then construct the following model:

On each of the dataflows we have noted the kind of information transported over those channels and the mechanism of communication. We also note our presumed trust boundary.

What this shows clearly is where data is flowing and by what means. We have for now skipped over the specific or precise meanings of some things hoping that the terms we have used are self-explanatory.

But, now we have formally written we can focus the discussion on specific aspects of the application, for example:
  • What mechanisms are being used to store the user ID and password in the "operating system"? Is this storage secure and sandboxed? I.e. how do we increase the area of trust boundary?
  • Are the communication mechanisms from the app to the social media and inventor appropriate?
  • What infrastructure information is implicitly included over these channels, for example, IP addresses, user agent strings [1] etc?
  • Does the app have authorisation to the various channels?
  • What is the granularity of the Location data over the various channels?
  • What information and channels are considered primary and which secondary?
  • Is the information flowing to the inventor appropriate and what is the content?
  • What about the EXIF data embedded in the picture?
Generating a formal list of questions from the above is relatively easy and this is exactly how we proceed.

The next questions which follow are related to how we can reduce the amount of information without compromising the application functionality and business model? For example:

  • can we reduce the granularity of the Location sent to the social media systems to, say, city level or country level?
  • can we automatically remove EXIF data?
  • do we allow the app to work if the operating system's location service is switched off or the user decides not to use this?

And so on...Finally  we get down to the consent related questions

  • What does the user see upon first-time usage of the app? What do they have to agree to?
  • Do we tell the user what underlying services such as GPS we're using as part of the application
  • Secondary data collection opt-out
  • For what reason is the data being collected over both primary and secondary channels
And so on again.

What we have done is set the scene, or, circumscribed what we need to investigate and decide upon. Indeed at some level we even have the means to measure the information content of the application and even the extent the consents by implication; and if we can measure then we have a formal mechanism to decide whether one design is "better" than another in terms of privacy.

In the following articles I'll discuss more about the classification mechanisms (information, security, usage, purpose, provenance) and other annotations along with the detailed implications of these.


[1] User agent strings are very interesting...just ask Panopticlick.

Monday, 7 October 2013

Anatomy of an Application's Dataflows

To evaluate privacy in the context of an application we must understand how the information flows between the user, the application, the external services the application uses and any underlying infrastructure or operating system services.

We can construct a simple pattern* to describe this:

Obviously the User is the primary actor in all of this, so that becomes the starting point for the collection of data, which then flows via the application itself in and out of the operating system and towards whatever back-end services, either provided for the application specifically or via some 3rd party, the application requires.

Note that in the above we define a trust boundary (red, dashed line) around the application - this denotes the area inside of which the user has control over their data and confidence that the data remains "safe".

Each data-flow can be, or must be, controllable by the user through some consent mechanism: this might be presentation of a consent text with opt-in/out or a simple "accept this or don't continue installing the application"-type consent.

We then consider the six data-flows and their protection mechanisms:

Data Flow "U"  (User -> Application)
  • This ultimately is the user's decision over what information to provide the application, and even whether the user installs or even runs the application in the first place. If anything then ensuring that the information collected here is relevant and necessary to the application's experience. 
  • Understanding the totality of data collected including that from additional sources and internal cross-referencing is critical to understanding this data-flow in its fullest context.
Data Flow "P" (Application -> Back-end Services)
  • This is the primary flow of data - that is the data which the application requires to function. 
  • The data here will likely be an extension of the data supplied by the user; for example, if the user uploads a picture, then the application may extend this with location data, timestamps etc.
  • The control here is typically embedded in the consent that the user agrees to when using the application for the first time. These consents however are often extended over other data flows too which makes it harder for the user to properly control this data flow
  • For some applications this data flow has to exist for applications to function.
Data Flow "S" (Application -> Back-end Services)
  • This is the secondary flow of data, that is data about the application's operations.
  • The control over this flow is typically embedded in the first time usage consent as data flow "P", but the option to opt-in/out has to be given specifically for this data collection, along with the usage of this data.
  • The implementation of this control may be application specific or centralised/federated over the underlying platform.
  • The data collected over here is not just from the application itself but may also include some data collected for primary means as well as any extended data collected from the infrastructure.
Data Flow  "3" (Application -> 3rd Parties)
  • Primarily we mean additional support functions, eg: federated login, library services such as maps and so on.
  • This data flow need to be specifically analysed in the context in which it is being used but would generally fall under the same consents and constraints as data flow "P".
Data Flows "O_in" and "O_out" (Application <-> O/S, Infrastructure)
  • The underlying platform, frameworks and/or operating system provide may services such as obtaining a mobile device's current location or other probe status, services such as local storage etc.
  • Usage of these services needs to be informed to the user and controlled in both directions, especially when contextual data from the application is supplied over data flow "O_in", eg: storage of data that might become generally available to other applications on the device
  • Collection of data over "O_out" may not be possible to control, but minimisation is always required due to the possibilities that data collected over "O_out" is forwarded in some forward over the data flows "P", "S" and "3".
  • Usually the underlying libraries and functionality of the platform are provided in the application's description before installation, eg: this application uses location services; though rarely is it ever explained why.
Any data-flow which crosses the trust boundary (red, dashed line) must be controllable from the user's perspective so that the user has a choice of what data leaves their control. Depending upon the platform and type of application this boundary may be wholly or partially inside the actually application process itself - care must be taken to ensure that this boundary is as wide as possible to ensure that the user does have trust in how that application handles their data.

The implementation of the control points on each of the data flows as has been noted, may be application specific or centralised across all applications. How the control is presented is primarily a user-interface manner and what controls and the granularity of those controls a user-experience manner.

The general pattern here is for each data-flow that crossed the trust boundary, a control point must be provided in some form. At no point should the user ever have to actually run the application or be in a state where information has to be sent over those data-flows without the control point being explicitly set.

So this constitutes the pattern for application interaction and data-flow; specific cases may have more or less specific data-flows as necessary.

Additional Material:

* There's a very good collection of patterns here at Privacy Patterns, though I've rarely seen patterns targeted towards the software engineer and in the GOF style, which is something we really do need in privacy! Certainly the patterns described at Privacy Patterns can be applied internally to the data-flow pattern given here - then we start approaching what we really do need in privacy engineering!