Wednesday, 30 October 2013

Diagrams Research

For a number of years I and some colleagues have worked closely with the University of Brighton's Visual Modelling Group using their work on diagrammatic methods of modelling and reasoning. One of the areas where we've had quite a nice success is in modelling aspects of information privacy [1] with some particularly useful and beautiful and natural representations of complex ideas and concepts.

Another area has been in the development of ontologies and classification systems - something quite critical in the area of information management and privacy. Some of this dates back to work we made with the M3 project and the whole idea of SmartSpaces incorporating the best of the Semantic Web, Big Data etc.

We've gained quite a considerable amount of value out of this relatively, simple industrial-academic partnership. A small amount of funding, no major dictatorial project plans but just letting the project and work develop naturally, or even if you like, in an agile manner, produces some excellent, useful and mutually beneficial results.

Indeed not having a project plan but just a clearly defined set of things that we need addresses and solved (or just tackled - many minds with differing points of view really does help!) means that both partners: the industrial and the academic, can get on with the work rather than battling an artificial project plan which becomes increasingly irrelevant and industrial focus and academic ideas change over time. Work continues with more ontology engineering in the OntoED project.

  1. I. Oliver, J. Howse, G. Stapleton. Protecting Privacy: Towards a Visual Framework for Handling End-User Data. IEEE Symposium on Visual Languages and Human-Centric Computing, San Jose, USA, IEEE, September, to appear, 2013.
  2. I. Oliver, J. Howse, G. Stapleton, E. Nuutila, S. Torma. Visualising and Specifying Ontologies using Diagrammatic Logics. In proceedings of 5th Australasian Ontologies Workshop, Melboune, Australia, CRPIT vol. 112, December, pages 37-47, 2009. Awarded Best Paper
  3. J. Howse, S. Schuman, G. Stapleton, I. Oliver. Diagrammatic Formal Specification of a Configuration Control Platform. 2009 Refinement Workshop, pages 87-104, ENTCS, November, 2009.
  4. I. Oliver, J. Howse, G. Stapleton, E. Nuutila, S. Torma. A Proposed Diagrammatic Logic for Ontology Specification and Visualization. 8th International Semantic Web Conference (Posters and Demos), October, 2009.
  5. J. Howse, G. Stapleton, I. Oliver. Visual Reasoning about Ontologies.International Semantic Web Conference, China, November, CEUR volume 658, pages 5-8, 2010.
  6. P. Chapman, G. Stapleton, J. Howse, I. Oliver. Deriving Sound Inference Rules for Concept Diagrams. IEEE Symposium on Visual Languages and Human-Centric Computing, Pittsburgh, USA, IEEE, September, pages 87-94, 2011.
  7. G. Stapleton, J. Howse, P. Chapman, I. Oliver, A. Delaney. What can Concept Diagrams Say? Accepted for 7th International Conference on the Theory and Application of Diagrams 2012, Springer, pages 291-293, 2012.
  8. G. Stapleton, J. Howse, P. Chapman, A. Delaney, J. Burton, I. Oliver.Formalizing Concept Diagrams. 19th International Conference on Distributed Multimedia Systems, International Workshop on Visual Languages and Computing, Knowledge Systems Institute, to appear 2013.

Wednesday, 23 October 2013

Security Classifications

We've introduced information classifiations, provenance, usage and purpose but so far neglected "security" classifications. These are the classic secret, confidential, public classifications so beloved of government organisations, Kafkaesque bureaucracies, James Bond's bosses etc.

Part of the ISO27000 standard for security directly addresses the need for a classification system to mark documents and data sets. How you set up your security classification is left open, though we tend to generally see four categories:
  • Secret
  • Confidential
  • Public
  • Unclassified
At least for everything other than Unclassified, the classifications tend to follow a strict ordering, though some very complex systems have a multitude of sub-classes which can be combined together.

For each level we are required to define what handling and storage procedures apply. For example:

  • May be distributed freely, for examples by placing on public websites, social media etc. Documents and data-sets can be stored unencrypted.

  • Only for distribution within the company. May not be stored on non-company machines or places on any removable media (memory sticks, CD etc). Documents and data-sets can be stored unencrypted unless they contain PII.

  • Only for distribution by a specific denoted set of persons. May not be stored on non-company machines or places on any removable media (memory sticks, CD etc). Documents and data-sets must  be stored encrypted and disposed of according the DoD standards.
Data that is marked unclassified should be treated as property of the author or authors of that document and not distributed. This would make unclassified a level higher than secret in our above classification.

A good maxim here is: Unclassified is the TOP level of security.

Sounds strange? Until document or data-set is formally classified how should it be handled?

Note in the above that we refer to the information classification of any data within a data set to further refine the encryption requirements for classified information. No classification system as described earlier exists alone, though ultimately they all end up being grounded to something in the security classification system. For example we can construct rules such as:
  • Location & Time => Confidential
  • User Provenance => Confidential
  • Operating System Provenance => Public
and applied to our earlier example:

we find that this channel and the handling of the data by the receiving component should conform to our confidential class requirements. A question here is that does a picture, location and time constitute PII?  Rules to the rescue again:
  • User Provenance & ID => PII
  • http or https => ID, Time
So we can infer that we need to protect the channel at least, either by using secure transport (https) or by encrypting the contents.

The observant might notice that ensuring protection of data as we have defined above for some social media services is not possible. This then provides a further constraint and a point to make an informed business decision. In this case anything that ends up at the target of this channel is Public by default, this means that we have to ensure that the source of this information, the user, understands that even though their data is being treated as confidential throughout our part of the system, the end-point does not conform to this enhanced protection. Then it becomes a choice on the user's part of whether they trust the target.

In the previous article about information contamination, we have an excellent example here. We need to consider the social media service as a "sterile" area while our data channel contains "dirty" or "contaminated" information. Someone needs to make the decision that this transfer can take place and in what form - invariably this is the source of the information, that is, the user of the application.

Does this mean that we could then reduce the protection levels on our side? Probably not, at least from the point of view that we wish to retain the user's trust by ensuring that we do our best to protect their data and task the responsible position of informing the user of what happens to their data once it is outside of our control.

Tuesday, 22 October 2013

Modelling Data Flow and Information Channels

Before we delve further into policies and more analysis of our models, I want to first take a small detour and look at the data channels in our models. We earlier explained that we could refine the various classifications on our channels down to fine grained rules, this is one kind of refinement. We can also refine the channel structure itself to make the various "conversations" between components clear.

Firstly, what is a channel? There's the mathematical explanation taken from [1]:
An information channel consists of an indexed family C = { f_i : A_i <-> C} i\in I of infomorphisms with a common codomain C, called the core of the channel.
Phew! Or a more everyday description that an information channel is the conversation between two elements such as persons, system components, applications, servers etc.

We also note that conversations tend to be directed in the direction that the information flows. We generally don't model the ack/nack type protocol communications.

Starting with our model from earlier:

We should and can refine this to explicitly distinguish the particular conversations that occur between the application and the back end.

While the two components communicate all this information, maybe even over the same implementation, we might wish to explicitly distinguish between the primary and secondary information flows. The reason could be due to differing consent mechanisms, or differing processing on the receiving end etc.

In the above we are explicitly denoting two different conversations. These conversations are logically separate and from the information usage point of view.

As we decompose our channels into the constituent, logically separate conversations we are also making decisions about how the system should keep apart the data transported over those conversations. Whether this translates into physical separation or however the logical differentiation is made is an architectural issue modelled and decided elsewhere.

As we shall see later when we decompose the processing elements in our data flow model we can track the flows and where those flows meet, diverge, cross-reference and infer points of possible contamination.


[1] Barwise and Seligman. (1997) Information Flow. Cambridge University Press.

Monday, 21 October 2013

Information as an Infectious Agent

Operating theatres are split into two parts:

  • the sterile field
  • the non-sterile surroundings

Any non-sterile item entering the sterile field renders it non-sterile; and stringent efforts and protocols [1,2] are made to ensure that this does not happen.

The protocols above extend via simply analogy [3,4] to information handling and information privacy.

  • Any body of data which is containing certain amounts and kinds of sensitive data we can consider to be non-sterile - assume for a moment that certain bits and bytes are infectious (great analogy!).
  • Everyone working with information is required to remain sterile and uncontaminated.
  • Information which is truly anonymous is sterile
  • Mixing two sets of information produces a single set of new information which is as at least as unclean as the dirtiest set of data mixed, and usually more so!
  • The higher the security classification the dirtier the information

We can extend this latter point to specific information types, eg: location, personal data, or certain kinds of usages and purposes, eg: data for advertising or secondary data and so on.

Extending our analogy further we can protect the sterile field in two ways:

  • ensuring that everyone in contact with the sterile field is sterile
  • ensuring that the equipment entering the sterile field is sterile

  • If two sets of data are to be mixed then ensure that the mixing occurs not in-situ but by generating a third data set kept separate from the two input sets
  • Data can be made more sterile by removing information content. But, be warned that certain kinds of obfuscation are not effective, eg: hashing or encryption of fields might just hide the information content of that field but not the information content of the whole data set [3]
  • Keep sterile and non-sterile data-sets apart, physically if possible
  • Ensure that sterile and non-sterile data-sets have differing access permissions. Ideally different sets of people with access
  • Clean up after yourself: secure data deletion, overwriting of memory, cache purges etc.

From a personnel point, in surgery precautions are made through restricting the persons inside the sterile field and even outside of this, basic precautions are taken in terms of protective clothing etc. While surgical attire might be overkill for office environments, the analogy here is that personnel with access to data have received the correct training and are aware of what data they can and can not use for various purposes.

In a surgical environment, everything entering and leaving the sterile field is checked and recorded. In an information systems environment this means logging of access so that when a breach of the sterile field occurs the route of the pathogen and its nature can be effectively tracked and cleaned.


[1] Infection Control Today - August 1, 2003 : Guidelines for Maintaining the Sterile Field
[2] Infection Control Today - November 7, 2006 - Best Practices in Maintaining the Sterile Field

Sunday, 20 October 2013

Data Aspects and Rules

In the previous post we introduced how to annotate data flows in order to understand better what data was being transported and how. In this post I will introduce further classifications expressed as aspects.

We already have transport and information class as a start; the further classifications we will introduce are:
  • Purpose
  • Usage
  • Provenance
Purpose is relatively straightforward, and consists of two classes: Primary and Secondary. These are defined in this previous posting.

Usage is remarkably hard to define and the categories tend to be quite context specific, though patterns do emerge. The base set of categories I tend to use are:
  • system provisioning - the data is being used to facilitate the running and management of the system providing the service, eg: logging, system administration etc.
  • service provisioning - the data is being used to facilitate the service itself; this means the data is necessary for the basic functionality of that service, or primary data.
  • advertising - the data is being used for advertising (tageted or otherwise), by the service provider or third party
  • marketing - the data is being used for direct marketing back to the source of the data
  • profiling - the data is being used to construct a profile of the user/consumer/customer. It might be useful in some cases to denote a subtype of this - CRM - to explicitly differentiate between "marketing"  and "internal business" profiling.
Some of the above tend to occur often together, for example, data for service provisioning is often also used for advertising and/or marketing too.

Provenance denotes the source of the information and is typically readable from the data-flow model itself. There does exist a proposed standard for provenance as defined by the W3C Provenance Working Group. It is however useful to denote for completeness purposes whether data has been collected from the consumer, generated through analytics over a set of data, from a library source etc.

We could enhance our earlier model thus:

As you can see, this starts to be quite cumbersome and the granularity is quite large. Though from the above we can already start to see some privacy issues arise.

The above granularity however is perfectly fine for a first model but to continue we do need to refine the model somewhat to better explain what is really happening. We can construct rules of the form:
  • "Info Class" for "Purpose" purpose used for "Usage"
for example taken from the above model:
  • Picture for Primary purpose used for Service Provisioning
  • Location for Primary purpose used for Service Provisioning
  • Time for Primary purpose used for Service Provisioning
  • Device Address for Secondary purpose used for System Provisioning
  • Location for Primary purpose used for Advertising
  • Location for Primary purpose used for Profiling
  • ...
and so on until we have exhausted all the combination we have, wish or require in our system. Note that some data comes from knowledge of our transport mechanism, in this case a device address (probably IP) from the use of http/s.

These rules now give us a fine grained understanding of what data is being used for what. In the above case, the flow to a social media provider, we might wish to query whether there are issues arising from the supply of location, especially as we might surmise that it is being used for profiling and advertising for example.

For each rule identified we are required to ask whether the source of that data in that particular data flow agrees to and understands where the flow goes, what data is transported and for what purposes; and then finally whether this is "correct" in terms of what we are ultimately promising to the consumer and according to law.

In later articles we will explore this analysis more formally and start also investigating security requirements, country requirements and higher level policy requirements such as safe harbour, PCI, SOX etc.

Sunday, 13 October 2013

Classifying Information and Data Flows

In the previous articles on data flow patterns and basic analysis of a data flow model we introduced a number of classifications and annotations to our model. Here we will explain two of these briefly:
  1. Data Flow Annotations
  2. Information Classification
Let's examine this particular data flow from our earlier example:

The first thing to notice is the data-flow annotation in angled brackets (mimicking the UML's stereotype notation) denoting the protocol or implementation used. It is fairly easy to come up with a comprehensive list of these, for example as a useful minimum set might be:
  • internal - meaning some API call over a system bus of some kind
  • http - using the HTTP protocol, eg: a REST call or similar 
  • https - using the HTTPS protocol
  • email - using email
and if necessary these can be combined to denote multiple protocols or possible future design decisions. Here I've written http/s as a shorthand.

Knowing this sets the bounds on the security of the connection, what logging might be taking place at the receiving end and also what kinds of data might be provided by the infrastructure, eg: IP addresses.

* * *

The second classification system we use is to denote what kinds of information are being carried over each data-flow. Again a simple classification structure can be constructed, for example, a minimal set might be:
  • Personal - information such as  home addresses, names, email, demographic data
  • Identifier - user identifiers, device identifiers, app IDs, session identifiers, IP or MAC addresses
  • Time - time points
  • Location - location information of any granularity, typically lat, long as supplied by GPS
  • Content - 'opaque' data such as text, pictures etc
Other classes such as Financial and Health might also be relevant in some systems.

Each of the above should be subclassed as necessary to represent specific kinds of data, for example, we have used the class Picture. The Personal and Identifier categories are quite rich in this respect.

Using high-level categories such as these affords us simplicity and avoids arguments about certain kinds of edge cases as might be seen with some kinds of identifiers. For example, using a hashed or so-called 'anonymous' identifier is still something within the Identifier class, just as much as an IMEI or IP address is. 

Note that we do no explicitly define what PII (personally identifiable information) is, but leave this as something to be inferred from the combination of information being carried both over and by the data flow in question.

* * *

Now that we have the information content and transport mechanisms made we can reason against constraints, risks and threats on our system, such as whether an unencrypted transport such as HTTP is suitable for carrying, in this case, location, time and the picture content; or would a secure connection be better? Then there is also the question of whether encrypting the contents and using HTTP?

We might have the specific requirements:
  • Data-flows containing Location must be over secured connection
  • Secured connections use either encrypted content or a secure protocol such as HTTPS of SFTP.
and translate these into requirements on our above system such as
  • The flow to any social media system must be over HTTPS
Some requirements and constraints might be very general, for example
  • Information of the Identifier class must be sent over secured connection
While the actual identifier itself might be a short-lived, randomly generated number with very little 'identifiability' (to a unique person), the above constraint might be too strong. Each retrenchment such as this can then be specifically evaluated for the additional introduced risk.

* * *

We have shown here is that by simple annotation of the data flow model according to a number of categories we can reason about what information the system is sending, to whom and how. This is the bare minimum for a reasonable privacy evaluation of any system.

Indeed even with the two above categories we can already construct a reasonably sophisticated and rigorous mapping and reasoning against our requirements and general system constraints. We can even as we briefly touched upon start some deeper analysis of specific risks introduced through retrenchments to these rules.

* * * 

The order in which things are classified is not necessarily important - we leave that to the development processes already in place. Having a model provides us with unambiguous information about the decisions made over various parts of the system - applying the inferences from these is the critical lesson to be taken into consideration.

We have other classifications still to discuss, such as security classifications (secret, confidential, public etc), provenance, usage, purpose, authentication mechanisms - these will be presented in forthcoming articles in more detail.

Constructing these classification systems might appear to be hard word; certainly it takes some effort to implement and ensure that they are active employed, but security standards such as ISO27000 do require this.

Thursday, 10 October 2013

Analysing Data Flow Models

In the previous post we introduced a pattern for the data-flows in and out of an application such as those found on your mobile phone or tablet. In this posting I want to expand on this pattern and explore various annotations to help us reason about how our application treats the information flowing through it.

Let's first introduce an example application, a photo sharing application. This app allows you to select a photo on your device and upload it to a social media account. Here's what it looks like to the user on a mobile device with rounded corners (as opposed to square corners which none of them seem to have):

It looks innocent enough but as we know there are many threats to user privacy even in the most innocent looking of places. So let's model what is really happening behind the scenes. We understand so far a number of things: the user supplies content and credentials for the services, these are stored locally for convenience, the app adds meta-data to the picture before uploading and sends information about the app's behaviour to the inventor of the app. We might then construct the following model:

On each of the dataflows we have noted the kind of information transported over those channels and the mechanism of communication. We also note our presumed trust boundary.

What this shows clearly is where data is flowing and by what means. We have for now skipped over the specific or precise meanings of some things hoping that the terms we have used are self-explanatory.

But, now we have formally written we can focus the discussion on specific aspects of the application, for example:
  • What mechanisms are being used to store the user ID and password in the "operating system"? Is this storage secure and sandboxed? I.e. how do we increase the area of trust boundary?
  • Are the communication mechanisms from the app to the social media and inventor appropriate?
  • What infrastructure information is implicitly included over these channels, for example, IP addresses, user agent strings [1] etc?
  • Does the app have authorisation to the various channels?
  • What is the granularity of the Location data over the various channels?
  • What information and channels are considered primary and which secondary?
  • Is the information flowing to the inventor appropriate and what is the content?
  • What about the EXIF data embedded in the picture?
Generating a formal list of questions from the above is relatively easy and this is exactly how we proceed.

The next questions which follow are related to how we can reduce the amount of information without compromising the application functionality and business model? For example:

  • can we reduce the granularity of the Location sent to the social media systems to, say, city level or country level?
  • can we automatically remove EXIF data?
  • do we allow the app to work if the operating system's location service is switched off or the user decides not to use this?

And so on...Finally  we get down to the consent related questions

  • What does the user see upon first-time usage of the app? What do they have to agree to?
  • Do we tell the user what underlying services such as GPS we're using as part of the application
  • Secondary data collection opt-out
  • For what reason is the data being collected over both primary and secondary channels
And so on again.

What we have done is set the scene, or, circumscribed what we need to investigate and decide upon. Indeed at some level we even have the means to measure the information content of the application and even the extent the consents by implication; and if we can measure then we have a formal mechanism to decide whether one design is "better" than another in terms of privacy.

In the following articles I'll discuss more about the classification mechanisms (information, security, usage, purpose, provenance) and other annotations along with the detailed implications of these.


[1] User agent strings are very interesting...just ask Panopticlick.

Monday, 7 October 2013

Anatomy of an Application's Dataflows

To evaluate privacy in the context of an application we must understand how the information flows between the user, the application, the external services the application uses and any underlying infrastructure or operating system services.

We can construct a simple pattern* to describe this:

Obviously the User is the primary actor in all of this, so that becomes the starting point for the collection of data, which then flows via the application itself in and out of the operating system and towards whatever back-end services, either provided for the application specifically or via some 3rd party, the application requires.

Note that in the above we define a trust boundary (red, dashed line) around the application - this denotes the area inside of which the user has control over their data and confidence that the data remains "safe".

Each data-flow can be, or must be, controllable by the user through some consent mechanism: this might be presentation of a consent text with opt-in/out or a simple "accept this or don't continue installing the application"-type consent.

We then consider the six data-flows and their protection mechanisms:

Data Flow "U"  (User -> Application)
  • This ultimately is the user's decision over what information to provide the application, and even whether the user installs or even runs the application in the first place. If anything then ensuring that the information collected here is relevant and necessary to the application's experience. 
  • Understanding the totality of data collected including that from additional sources and internal cross-referencing is critical to understanding this data-flow in its fullest context.
Data Flow "P" (Application -> Back-end Services)
  • This is the primary flow of data - that is the data which the application requires to function. 
  • The data here will likely be an extension of the data supplied by the user; for example, if the user uploads a picture, then the application may extend this with location data, timestamps etc.
  • The control here is typically embedded in the consent that the user agrees to when using the application for the first time. These consents however are often extended over other data flows too which makes it harder for the user to properly control this data flow
  • For some applications this data flow has to exist for applications to function.
Data Flow "S" (Application -> Back-end Services)
  • This is the secondary flow of data, that is data about the application's operations.
  • The control over this flow is typically embedded in the first time usage consent as data flow "P", but the option to opt-in/out has to be given specifically for this data collection, along with the usage of this data.
  • The implementation of this control may be application specific or centralised/federated over the underlying platform.
  • The data collected over here is not just from the application itself but may also include some data collected for primary means as well as any extended data collected from the infrastructure.
Data Flow  "3" (Application -> 3rd Parties)
  • Primarily we mean additional support functions, eg: federated login, library services such as maps and so on.
  • This data flow need to be specifically analysed in the context in which it is being used but would generally fall under the same consents and constraints as data flow "P".
Data Flows "O_in" and "O_out" (Application <-> O/S, Infrastructure)
  • The underlying platform, frameworks and/or operating system provide may services such as obtaining a mobile device's current location or other probe status, services such as local storage etc.
  • Usage of these services needs to be informed to the user and controlled in both directions, especially when contextual data from the application is supplied over data flow "O_in", eg: storage of data that might become generally available to other applications on the device
  • Collection of data over "O_out" may not be possible to control, but minimisation is always required due to the possibilities that data collected over "O_out" is forwarded in some forward over the data flows "P", "S" and "3".
  • Usually the underlying libraries and functionality of the platform are provided in the application's description before installation, eg: this application uses location services; though rarely is it ever explained why.
Any data-flow which crosses the trust boundary (red, dashed line) must be controllable from the user's perspective so that the user has a choice of what data leaves their control. Depending upon the platform and type of application this boundary may be wholly or partially inside the actually application process itself - care must be taken to ensure that this boundary is as wide as possible to ensure that the user does have trust in how that application handles their data.

The implementation of the control points on each of the data flows as has been noted, may be application specific or centralised across all applications. How the control is presented is primarily a user-interface manner and what controls and the granularity of those controls a user-experience manner.

The general pattern here is for each data-flow that crossed the trust boundary, a control point must be provided in some form. At no point should the user ever have to actually run the application or be in a state where information has to be sent over those data-flows without the control point being explicitly set.

So this constitutes the pattern for application interaction and data-flow; specific cases may have more or less specific data-flows as necessary.

Additional Material:

* There's a very good collection of patterns here at Privacy Patterns, though I've rarely seen patterns targeted towards the software engineer and in the GOF style, which is something we really do need in privacy! Certainly the patterns described at Privacy Patterns can be applied internally to the data-flow pattern given here - then we start approaching what we really do need in privacy engineering!

Sunday, 6 October 2013

What I look for in a software engineer

When called to assess or interview someone for a software engineering position there are a few things I look for: the core set of knowledge on which we can build upon to support the more specialist or esoteric skills such as security, privacy and the rest. I discussed ideas for what would be required for a grounding in concepts for privacy earlier.

Programming Skill
  • This is obviously the first question - "can you program?" and then in what languages. The usual candidates will always be there, ie: Java, Python, C, C++ etc, but I'd like to see things such as SQL, Prolog, Lisp and maybe things like Haskell, ML, Ada, Eiffel etc. Depending upon the position PHP, Perl, Unix shell script etc.
  • Ideally the candidate should understand and appreciate the differences and respective advantages of the various programming paradigms: structured, object-oriented, functional etc
  • "Programming" in markup languages is not acceptable: HTML IS NOT A PROGRAMMING LANGUAGE.

Design Skill
  • The ability to use design languages such as UML, SDL etc, must always be a prerequisite. Correct use of these come with and comprehension of abstraction and refinement of designs.
  • Abstraction is a very underappreciated and rarely seen skill. The ability to concentrate on what is required without mixing abstraction levels unnecessarily is one of the marks of an engineer.

Discipline and Formality
  • Despite agile being the saviour of everything a true appreciation of what agile actually is, is necessary. The understanding that designs follow a natural flow from concept through design to implementation is invariant - good software can not be hacked.
  • Communication is obviously part of this and clear coding, design and concepts is fundamental.
  • Experience with formal methods is always a clear advantages. Now whether someone actually uses Alloy, B, Z, VDM etc in another matter altogether; the skills and discipline that comes with these is what we're looking for. Some of the best "coders and architects I have come across are formal methods people at heart; indeed I know of on agile guru who attributes his success to applying formal methods within an agile framework (see communication, abstraction and refinement above).
  • This also extends to areas such as testing of code, simulation of design and defensive programming though ideas such as design by contract - all of which contribute to a better quality of development and code.
  • Does the software engineer have a grounding in computer science and/or mathematics? Having this allows the engineer to concentrate on the underlying principles of things without being distracted by arguments over whether something should be represented in XML vs RDF vs JSON, or SQL vs NoSQL etc.
Obviously this just touches the surface of what is required but I hope it gives some idea of what is required. Not all skills are needed, so if someone doesn't have 3 years experience of category theory and programming with monads (the minimum to write "hello world" in Haskell I'm told), but an appreciation and understanding is required.

Ultimately engineering is about solving problem with discipline, rigour and use of the appropriate techniques.

Wednesday, 2 October 2013

Top Ten Privacy Threats and Risks

OWASP publishes a Top Ten Security Threat list every year and all things being equal there is a demand for a similar Top Ten Privacy Threat list; except that a nice, neat list like OWASP's doesn't exist.

The other problem with a Top 10 list is that they implicitly promote a specific threat over another - at least to me the metrics that define the ordering aren't clear. So without lingering on the metrics, just a search for "the top 10 privacy threats (that should be taken together equally)" reveals the following:
Geo Tags, Wifi Sniffing/War Driving, Facial Recognition, Censorship, SmartPhones, Data Stealing, Hackers, Social Networks, You, Poor Network security, Improper Data Handling, Improper Data Destruction, Identity Theft, Passwords, Social Engineering, Cloud, Cookies, Tracking, Location, Media Sharing, Government
which is quite a list and in quite a few cases either blames security, a whole technology, eg: "Cloud", or verges on paranoia, eg: it's the Government's fault (ok, so that might be true but there's not a lot you can do without political or societal change).

I'd like to start with the following in no particular order*:

Location Gathering
  • Practically every mobile device can capture location either through GPS, CellID  or Wifi positioning (for the latter even your static home PC/Mac/xyz can too!)
  • While some applications depend upon location, eg: mapping, navigation, location, others use it for superfluous or dubious extra features.
  • This is often found combined with secondary data collection and forced consent.
Media Sharing
  • When you sent an email, make a call, share a picture or tweet a comment, not only is the content there but the meta-data including location, device used, IP addresses, user identifiers, machine identifiers, to whom the material was addressed time stamps and so on. 
  • The NSA and GCHQ (and others!) are just doing what Facebook, Google and every one else is doing. Twitter, Facebook and others make your data available generally too - who needs wiretapping?!
  • The actual content of the message is almost secondary to the above; that requires further processing which may be superflous to what is already there by default.
Improper Data Handling
  • I've covered the guidelines for data handling, but the amount of people who have access to your data is quite substantial. Some have legitimate access directly such as system administrators and certain analysts, but once data leaves the control of a core set of people then all bets are off.
  • Here's a good set of search results: [Google] [Bing]
  • Do Not Track is the mantra, yet the W3C's attempt seems to live and die like Schrödinger's cat. 
  • Identifiers are inherent throughout the protocol stacks we are using. Indeed even the most innocuous identifiers such as random session IDs can be used to track someone
  • Even if we get rid of identifiers we still have semantics and stylistic analysis of the content and not forgetting a host of other fingerprinting techniques.
  • Cross-referencing two data sets leads to huge leaps in understanding.
  • Identifiers as database keys are the common method, but most fields can be matched even in imprecise and statistical ways leading to novel methods of tracking.
Semantic Misunderstanding
Time Series/Temporal Databases
  • Capturing any data over time will reveal patterns. This is one of the cornerstones of BigData
  • No such thing as anonymity
  • The underlying protocols of the internet reveal huge amounts of meta-data even without relying on the content you're sending
  • Not only that but your "fingerprint", that is the pattern of usage and data you leave identifies you, for example, the pairs of locations you enter into your car navigator...
  • Obfuscating identifiers using hashing (even salted hashes) still leaves a valid, consistent (over time) identifier to tie things together.
Forced or Implicit Consent 
  • The most annoying thing when trying a new application (or app!) is the forced consent to data collection, for example, many applications will not start unless you've consented to location or other data capture which might be inappropriate for that application.
  • Consider this example taken from a random app in the Windows Phone store...
  • Is it really necessary for a calendar app to require access so much information? Note also that it is not explained here why the app needs this or whether that information is communicated.
Secondary Data Collection 
Privacy By Design
  • A  bit controversial this one, but a simple list of principles with a huge semantic gap between those and what the engineers and programmers have to do doesn't help anyone, except those who write documents enshrining principles and engage in a "we're more private than you" battle.
  • The Agile Manifesto doesn't by itself create better code but relies upon legions of skilled engineers to properly understand and implement its principles; PbD is not aimed at the engineer. Lessig had something to say about this: Code is Law.

So that's my personal set, described from the consumers' perspective. I'll follow from here in a later article about how we as engineers and developers can deal with the above without compromising business needs.

*I know there's more than 10, but 11 is better...