Tuesday 18 December 2012

Code is Law, Inherent Privacy and a Few Uncomfortable Issues

Lawrence Lessig stated that "code is law" - a maxim that above all should be the most critical in software engineering, especially when put in the context of implementing privacy and security.

I want to talk about some issues that worry me slightly (ok, a lot!). The first is that despite of policies, laws etc, the final implementation of anything related with privacy is in the code the programmers write. The second is that we are building our compliance programmes upon grand schemes and policies and paying piecemeal attention to the actual act of software engineering. The latter we attempt to wrap up in processes and "big ideas", for example, Privacy by Design.

Now before the PbD people get too upset, there's nothing wrong with stating and enumerating your principles, the Agile Manifesto is a great example of this, however there is no doubt that many implementation of agile are poor at best and grossly negligent and destructive at worst. The term used is "technical debt".

Aside: the best people I've seen conduct software development in an agile manner are formal methods people...I guess due to the discipline and training in the fundamentals they've received. This also applied to experienced architects, engineers and programmers for whom much of this formality is second nature.

Addressing the first point: no matter how many policies or great consumer advocates or promises you make, at the end of the day, privacy must be engineered into the architecture, design and code of your systems. It does not matter many powerpoint slides or policy documents or webpages your write, unless the programmers "get it", you can forget privacy, period!

Aside: Banning powerpoint may not be such a bad idea....

Herein lies a problem, the very nature of privacy in your systems means that it crosscuts every aspects of your design and ultimately your whole information strategy. Most of these things do not obviously manifest themselves in the design and code of your systems.

To solve this there must be a fundamental shift from the consumer advocacy-legal focus of privacy to a much deeper, technical or engineering, even scientific approach. This however does not just mean focusing on the design and code, though that is fundamental to the implementation, but to the whole stack of management and strategy from the highest directors to the programmers.

I've seen efforts in this direction but stop at the product management - "Hey, here are the privacy requirements - implement them!" ... which does feel good in that you are interacting, or believe that you are interacting, with the products you are producing but still not sufficiently with the people who really build these. Just producing requirements doesn't help: you need that interaction and communication right across the company.

Of course all of the above is extremely difficult and leads us to our next point which is how we build our compliance programmes in the first place. The simple question here is "are you fully inclusive?", meaning do you include programmers, architects (technical people with everyday experience) or is the programme run by non-technical, or formerly technical staff? Invariably it is the latter.

Compliance programmes must be inclusive otherwise the necessary inherency required to successfully and sufficiently implement the ideas and strategies of that programme will be lost - usually in a sea of powerpoint and policy documents.

Firstly in order to achieve inherent privacy (or security, or xyz) focus must lie on onboarding and educating the programmers, the designers and the architects and less focus on the management, prescription and consumer advocacy. Secondly, any compliance programme must be inclusive and understand the needs of the said technical staff. Thirdly, the engineering and technical staff are the most critical components in your organisation.

Compliance programmes are often measured on the amount of documentation produced (number of slides even?), however this ends up with a self feeding process where for the compliance programme to survive it needs to keep the fear of non-compliance at the fore. Read Jeff Jarvis' article on Privacy Inc.: Scare and Sell and then Jim Adler's talk about PII2012 on The Emergent Privacy-Industrial Complex and you get an idea of what is going wrong. Avoid at all costs creating a privacy priesthood in your compliance programmes.

Aside: This might be good old fashioned economics - if a compliance programme actually worked then there'd be no need for the programme in the end.

There are two interrelated caveats that also need to be discussed, the first of which is that any work in privacy will expose the flaws, holes and crosscutting issues across your products, development programmes and management and engineering skill bases. For example, a request to change a well crafted design to cope with some misunderstood ambiguity in privacy policy is going to end in tears for all concerned. It will demand of management, engineering and your compliance programme a much deeper [scientific] knowledge of what information your products are using, carrying, collecting and processing - to a degree uncommonly found in current practices.

The fundamental knowledge require to really appreciate information management and privacy is extensive and complex. Awareness courses are a start but I've seen precious few courses even attempting to cover the subject of privacy from a technical perspective.

Secondly, privacy will force you to examine your information strategy - or even create and information strategy - and ask very awkward and uncomfortable questions about your products and goals.




Saturday 24 November 2012

Maxims for Privacy

A collection of maxims relating to privacy might be interesting as we seem to have some common principles that fit well to this developing. Anyway here are some that come to mind from my work:

Don't collect what you don't use

If you have no use for the information you're collecting from the user, or can attribute no value to it in terms of the product being presented, then don't collect it at all.

If it looks like PII, it probably is PII, so treat it as PII

PII doesn't just mean identifiers but can extend to other kinds of data, eg: device identifiers, session identifiers, tracking etc...if you can map a set of data to a single unique person, or even a small, significantly small group of persons then it probably is PII.

Don't shock the user

You collected what??!?!!  You shared what without my permission?!?!  How did you learn that about me??!?! I didn't agree to that??!?!?  T&C's...I didn't understand them, let alone read them...and you still did 'x' !! Enough said...

Location data isn't just GPS co-ordinates

Maybe this could be rewritten as "Never underestimate the power of geolocation functions"....it however does come as a surprise that many types of data can be transformed into locations, even if very granular...who needs a GPS receiver anyway?!  Pictures often contain vast amount of EXIF data including locations, mobile phone cell IDs can be mapped to precise location and triangulated for example...

Good security does not mean good privacy, but good privacy doesn't come without good security

If your system isn't protecting the communications and data then it doesn't matter what you do to obfuscate or anonymise the data. Actually getting the balance between these sorted requires some serious engineering competence - security and privacy engineers are few and far between.

All information can be transformed and cross-referenced into whatever you need.

Need to think about how that one is written but whatever data you have can be transformed (eg: extracting usernames and other parameter data from URLs, meta-data extractions etc) and cross-referenced...

This one is more of an apophthegm, and a reminder that we need to serve the users not ourselves:

Security through Obscurity, Privacy through PowerPoint and Policies...

See Jim Adler's talk at #pii2012 on the Emergent Privacy-Industrial Complex - a term coined by Jeff Jarvis in an article called Privacy, Inc: Scare and Sell. But seriously we forget that ultimately privacy gets implemented through good, sound software engineering in C, Java, Python etc. Getting privacy as an inherent part of system engineering practice (not just processes) is probably the most critical thing to get all those policies, laws and good intentions implemented. As Schneier likes to use the term "security theater", I think we can often use "privacy theater" to reflect many of the things we are doing, see: Ed Felten's posting on this on the Freedom to Tinker site.

Thursday 22 November 2012

Information Privacy: Art or Science?

I was handed a powerpoint deck today containing notes for a training course on privacy. One thing that struck me was the statement on one of the slides, in fact it was the only statement on that slide:

PRIVACY IS AN ART

This troubles me greatly and the interpretation of this probably goes a long way into explaining some things about the way information privacy is perceived and implemented.

What do we mean by art, and does this mean that privacy is not a science?

Hypothesis 1: Privacy is an art

If you've ever read great code it is artistic in nature. You can appreciate the amount of understanding and knowledge that has gone into writing that code. Not just at the act of writing, or the layout and indentation, but in the design of the algorithms, the separation of concerns, the holistic bigger picture of the architecture. Great code requires less debugging, performs well, stays in scope, and if it ever does require modification, it is easy to do. Great programmers are scientists - they understand the value to the code, they avoid technical debt, they understand the theory (maybe only implicitly) and the science and discipline behind their work and in that respect they are the true artists of their trade.

For example, Microsoft spent a lot of effort in improving the quality of its code with efforts such as those the still excellent book Code Complete by Steve McConnell. This book taught programmers great techniques to improve the quality of their code. McConnell obviously knew what works and what didn't from a highly technical perspective based on a sound, scientific understanding of how code works, how code is written, how design is made and so on.

I don't think information privacy is an art in the above sense.

Hypothesis 2: Privacy is an "art".

In the sense that you're doing privacy well in much the same was as a visitor to an art gallery knows "great art". Everyone has their own interpretation and religious wars spring forth over whether something is art or not.

Indeed here is the problem, and in this respect I do agree that privacy is art. Art can be anything from the formal underpinnings of ballet to the drunken swagger of a Friday night reveler - who is to say that the latter is not art? Compare ballet with forms of modern and contemporary dance: ballet is almost universally considered "art" while some forms of contemporary dance is not - see our drunken reveler at the local disco...this is dance, but is it art?

Indeed sometimes the way we practice privacy is very much like the drunken reveler but telling everyone at the same time that "this is art!"

What elevates ballet, or the great coder, to become art is that they both have formal, scientific underpinnings. Indeed I believe that great software engineering and ballet have many similarities and here we can also see the difference between a professional dancer and a drunken reveler on the dance floor: one has formal training in the principles and science of movement, one does not.

Indeed if we look at the sister to privacy: security, we can be very sure that we do not want to practice security of our information systems in an unstructured, informal, unscientific manner. We want purveyors of the art - artists - of security to look after our systems: those that know and intuitively feel what security is.

There are many efforts to better underpin information privacy, rarely do these come through in the software engineering process in any meaningful manner unless explicitly required or audited for. Even then we are far from a formal, methodical process by which privacy becomes an inherent property of the systems we are building. When we achieve this as a matter of the daily course of our work then, and only then, privacy will become an art practiced by artists.

Tuesday 20 November 2012

Understanding PII

The most important question in data management related to privacy is when a given set of data contains personally identifiable information (PII). While PII is fairly well defined (maybe not a strict formal defintion), often the question of what constitutes PII and how to handle it remains. In this article I intend to describe some notions on the linkability and tracability of identifiers and their presence in a data set.

It is worth mentioning that the definition of PII is currently being addressed by the EU Article 29 Working Party which will eventually bring additional clarification at a legal level beyond that existing the the current privacy directives. Ultimately however, the implementation of privacy is in the hands of software and system architects, designers and programmers.

I'll start with a (personal) definition: data is personally identifiable if, given any record of data, are there fields that contain data that can be linked with a unique person, or, a sufficiently small group with a common, unique attribute or characteristic.

We are required then to look at a number of issues:
  • what identifiers or identifying data is present
  • how linkable the data is to a unique person
  • how traceable a unique person is through the data set
Much is based upon the presence of identifiers in a data set and the linkability of those identifiers. For example, the presence of user account identifiers, social security numbers are very obviously in a one-to-one correspondence with a unique person and thus are trivially linkable to a unique user. Identifiers such as IMEI or IP addresses do not have this one-to-one correspondence. There are some caveats with situations such as  when one user account or a device is shared or used occasionally (with permission perhaps) by more than one person.

The notion of a unique person does not necessarily mean that an actual human being can be identified but rather can be traced, unambiguously through the data. For example, an account identifier (eg: username) identifies a unique person but not a unique human being - this is an important, if subtle distinction.

We can construct a simple model of identifiers and how these identifiers relate to each other and to "real world" concepts - some discussion on this was made in earlier articles. These models of identifiers and their relationships and semantics are critical to correct understanding of privacy and analysis of data. Remarkably this kind of modelling is very rarely made and very rarely understood or appreciated - at least until time comes to cross-reference data and these models are missing (cf: semantic isolation).
Example Semantic Model of Identifier Relationships

Within these models as seen earlier, we can map a "semantic continuum" from identifiers that are highly linkable to unique and even identified persons to those which do not.

Further along this continuum we have identifiers such as various forms of IMEI, telephone number, IP addresses and other device identifiers. Care must be made in that while these identifiers are not highly linkable to a unique person, devices and especially mobile devices are typically used by a single unique person leading to a high degree of inferred linkability.

Device addresses such as IP addresses have come under considerable scrutiny regarding whether they do identify a person. In cases where an IP address of a router has been used to identify is someone has been downloading copyrighted or other illegal material is problematical for the reasons described earlier regarding linkability. In this specific case of IP addresses, network address translation, proxies and obfuscation/hiding mechanisms such as Tor complicate and minimise linkability.

As we progress further along we reach the application and session identifiers. Certainly application identifiers if they link to applications and not individually deployed instances of applications are not PII unless the application has a very limited user base: an example of a sufficiently small,  group with common characteristics. For example, an identifier such as "SlighlyMiffedBirdsV1.0" used across a large number of deployments is very different from "SlighlyMiffedBirdsV1.0_xyz" where xyz is issued uniquely to each download of that application. Another very good example of this kind of personalisation of low linkability identifiers is the user agent string used to identify web browsers.

Session identifiers ostensibly do not link to a person but can reveal a unique person's behaviour. On their own they do not constitute PII. However session identifiers are invariably used in combination with other identifiers which does increase the linkability significantly. Session identifiers are highly traceable in that sessions are often very short with respect to other identifiers - capture enough sessions and one can fingerprint or infer common behaviour of individual persons.

When evaluating PII, Identifiers are often taken in isolation and analysis made there. This is one of the main problems in evaluating PII: Identifiers rarely exist in isolation and a combination of identifiers together reveals a unique identity.

Just dealing with identifiers alone, and even deciding what identifiers are in use provides the core of deciding whether a given data set PII. However identifiers provide the linkability, they do not provide tracability which is given through the temporal components which is often, if not, invariably present in data sets. We will deal with other dimensions of data later where we look deeper into temporal, location and content data. Furthermore we will also look at how data can be transformed into other types of data to improve traceability and linkability.

Monday 19 November 2012

Do Not Track and Beyond (W3C Workshop)

There's a workshop on Do Not Track organised by the W3C:

W3C Workshop: Do Not Track and Beyond  

26-27 November 2012

Background

Out of the April 2011 W3C workshop on Web Tracking and User Privacy, W3C chartered its Tracking Protection Working Group, which commenced work in September. The Working Group has produced drafts of Do Not Track specifications, concurrent with various implementations in browsers and Web sites and along side heightened press and policymaker attention. Meanwhile, public concern over online privacy — be it tracking, online social networking or identity theft — remains. 

A large number of very interesting papers to be presented including one of mine based on earlier articles I have written here on a dystopian post-DNT future and the update to that.

Ian Oliver (2012)  An Advertisers Paradise: An Adventure in a Dystopian Post-“Do Not Track World”? W3C Workshop: Do Not Track and Beyond, 26-27 November 2012.

 

Tuesday 13 November 2012

Measuring Privacy against Effort to Break Security

As part of my job I've needed to look at metrics and measurement of privacy. Typically I've focussed on information entropy versus, say, number of records (define "record") or other measurements such as amount of data which do not take into consideration the amount of information, that is, the content of the data being revealed.

So this lead to an interesting discussion* with some of my colleagues where we looked at a graph like this.

The y-axis is a measure of information content (ostensibly information entropy wrt to some model) and the x-axis a measure of the amount of force required to obtain that information. For any given hacking technique we can deliniate a region on the x-axis which corresponds to the amount of sophistication or effort placed into that attack. The use of the terms, effort and force here come from the physics and I think we even have some ideas on how the dimensions of these map to the security world, or actually what these dimensions might be.

So for a given attack 'x', for example an SQL inject attack against some system to reveal some information 'M', we require a certain amount of effort just for the attack to reveal something. If we make a very sophisticated attack then we potentially reveal more. This is expressed as the width of the red bar in the above graph.

One conclusion here is that security people try to push the attack further to the right and even widen it, while privacy people try to lower and flatten the curve, especially through the attack segment.

Now it can be argued that even with a simple attack, over time the amount of information increases, which brings us to a second graph which takes this into consideration:


Ignoring the bad powerpoint+visio 3D rending, we've just added a time scale (z-axis, future towards back), we can now capture or at least visualise the statement above that even an unsophisticated attack over time can reveal a lot of information. Then there's a trade-off between a quick sophisticated attack versus a long, unsophisticated attempt.

Of course a lot of this depends upon having good metrics and good measurement in the first place and that we do have real difficulties with, though there is some pretty interesting literature [1,2] on the subject and in the case of privacy some very interesting calculations that can be performed over the data such as k-anonymity an l-diversity.

I have a suspicion that we should start looking at privacy and security metrics from the dimensional analysis point of view and somewhat reverse engineer what the actual units and thus measurements are going to be. Something to consider here is that the amount of effort or force of an attack is not necessarily related to the amount of computing power, for example, brute forcing an attack on a hash function is not as forcible as a well planned hoax email and a little social engineering.

If anyone has ideas on this please let me know.


References

[1] Michele Bezzi (2010) An information theoretic approach for privacy metrics. Transactions on Data Privacy 3, pp:199-215
[2] Reijo M. Savola (2010) Towards a Risk-Drive Methodology for Priavcy Metrics Development. IEEE International conference on Social Computing/IEEE International Conference on Privacy, Security, Risk and Trust.



*for "discussion" read 'animated and heated arguments, a fury of writing on whiteboards, excursions to dig out academic papers, mathematics, coffee etc' - all great stuff :-)

Tuesday 6 November 2012

Inherent Privacy

I've long had a suspicion that when building information systems and specifically when reviewing and auditing such systems that the techniques that need to be used and developed are effectively the very same tools and techniques that are used in the construction of safety-critical and fault tolerant systems.

As privacy is fast becoming the number one issue (if it isn't already) with regards to consumers' data, the amount of effort in consumer advocacy and legal aspects is outstripping the ability of the technical community to keep up with the required architectures, platforms, design, implementation and techniques for achieving this. Indeed there is some kind of arms race going on here and I'm not sure it really is in the benefit of the consumer.

For example, the Do Not Track technical community has come under criticism for not delivering a complete solution. I don't think this really is 100% the fault of the W3C or the good people developing these standards but rather the lack of
  1. understanding information systems (even in the theoretical sense) and 
  2. a lack of applicable and relevant tools and techniques for the software engineering community who at the end of the day are the ones who will end up writing the code that implements the legal directives in some for or other. But we digress.
Performing a little research we come across the term "Inherent Safety" (see [3]), defined:
Inherent safety is a concept particularly used in the chemical and process industries. An inherently safe process has a low level of danger even if things go wrong. It is used in contrast to safe systems where a high degree of hazard is controlled by protective systems. It should not be confused with intrinsic safety which is a particular technology for electrical systems in potentially flammable atmospheres. As perfect safety cannot be achieved, common practice is to talk about inherently safer design. “An inherently safer design is one that avoids hazards instead of controlling them, particularly by reducing the amount of hazardous material and the number of hazardous operations in the plant [3].

Taking this as a starting place I decided to have a go at rewriting the principles in the privacy contenxt as below - taking the extensions as proposed in [4] into consideration:

  • Minimize: reducing the amount of information/data present at any one time either in local storage, remote storage cache or network transfer
  • Substitute: replace one element of data with another of less privacy risk, eg: abstract GPS coordinates to city areas
  • Moderate: reduce the strength of a process to transform or analyse data, eg: reduce the amount of crossreferencing over a number of sets of data
  • Simplify: reduce the complexity of processes to control data, eg: single opt-out, one-time consent, simple questions (and not asking the user what resources an app should have access too..)
  • Error Tolerance: design the system/software and processes to deal with worst cases, eg: misbehaving applications sending too much data are blocked in some sense
  • Limit Effects: designing the system such that the effects of any leak of data is minimised, eg: properly anonymised or pseduo-anonymised data sets, secure transport layers, encryption etc

While admittedly not fully worked out I feel that this is more communicable and understandable to the software engineers that, say, Privacy by Design, which while lays out a good set of principles, is too high-level and abstract to map to the engineers and their day-to-day work. Actually I feel the problem with the Principles of Privacy by Design is that they can be (and are!) taken like the principles of the Agile Manifesto leading to some bizarre and extreme (and wrong!) ideas of what Agile Processes are - just take a look at some of the later writings of Ward Cunningham or Scott Ambler on the subject.

One aspect of the inherent safety idea that particularly appeals is that it is more grounded in engineering and practical development rather than being a set of principles. Indeed much of the grounding for this work comes from a very practical need and development through sound engineering practice espoused by Trevor Klenz. His quote "what you don't have, can't leak" applies equally to information as it does hazardous substances; Klentz's book [5] maybe should become required reading along with Solove and Nissenbaum.

As a further example, the HAZOP (Hazard and operability study) method(s) are purely formal methods in the true sense of the word in constructing a small, formally defined vocabulary and modelling standards - compare with process flow diagram in the chemical and process engineering with the data-flow diagram in software engineering for example.

I'd like to finish with a reference to HACCP (Hazard analysis and critical control points) which itself has a number of principles (seven seems to be a common number), but here's I'd like to concentrate for the moment on just two:

Principle 2: Identify critical control points. – A critical control point (CCP) is a point, step, or procedure in a food manufacturing process at which control can be applied and, as a result, a food safety hazard can be prevented, eliminated, or reduced to an acceptable level.

Where do we control the flow of information? Is it near the user or far from? Is it honour based and so on? The further from the source of the information, the greater the chance of leakage (both in information and chemical systems).

Principle 4: Establish critical control point monitoring requirements. – Monitoring activities are necessary to ensure that the process is under control at each critical control point. In the United States, the FSIS is requiring that each monitoring procedure and its frequency be listed in the HACCP plan.

This is something that I guess we're very bad with - do we ever monitor what information we keep in databases? Even the best intentions and best designs might still leak something, and this is especially true when working with large sets of information that can be cross referenced and fingerprinted. Maybe we should consider some kinds of information to be analogous to bacteria or viruses in a food handling situation?

So, just an example of from where we should be getting the basis for the tools and techniques and principles of really engineering for privacy. I stand by my assertion that in order to engineer information system correctly for privacy we must consider those systems to be safety-critical and treat them accordingly. I'll discuss counterarguments and how we get such techniques into our "agile" software engineering processes later.

References

[1] Stefan Kremp, European Commissioner concerned about "Do Not Track" standard, The H Open. 12 October 2012

[2] Claire Davenport, Tech standards body diluting Web privacy: EU official, Reuters, 10 October 2012

[3] Heikkilä, Anna-Mari. Inherent safety in process plant design. An index-based approach. Espoo 1999, Technical Research Centre of Finland, VTT Publications 384. ISBN 951-38-5371-3

[4] Khan, F. I. & Amoyette, P. R., (2003) Canadian Journal of Chemical Engineering vol 81 pp 2-16 How to make inherent safety practice a reality

[5] Kletz, T. A., (1991) Plant Design for Safety – A User-Friendly Approach, Hemisphere, New York

Thursday 18 October 2012

Two things: Mars and Mathematics

An interlude to my hiatus of posting...trying to write a paper based on the earlier DNT article...

There are couple very nice things I want to post here, mainly for future reference and because, well just because :-)

The first is a panorama of Mars taken by Curiosity found via the Bad Astronomy blog (regular and compulsory reading): what would it look like if you could stand (where Curiosity is) on Mars...


As explained by Phil Plait, these pictures were stitched together by Denny Bauer from a series of pictures from Curiosity's MastCam - amazing work-

This I rate in the same category as the Titan surface picture taken by Huygens, and talking of Huygens I found (via BadAstronomy) a link to a posting about the surface of Titan being somewhat like wet sand which then led to an article about Huygen's landing and then to an ESA page which details the landing with a video reconstruction. I know that Curiosity's landing was pretty spectacular and we have videos, but Huygens did it much, much further from home after a longer journey onto a moon that was a complete mystery - isn't science amazing!

Then there is a posting on n-Category Cafe about set theory and order theory and the dependence of one on the other: The Curious Dependence of Set Theory on Order Theory (Tom Leinster). The posting the question:

Is it strange that results about sets should depend on results about order?

and offers two answers: yes and no ....

Rather than go into the mathematics, the discussion of the point is fascinating from two angles: firstly, that isn't it amazing how results in one area of mathematics appear to be inextricably linked with results in seemingly unrelated areas. I guess elliptic curves and modular forms via the Taniyama-Shimura conjecture is a good example. Secondly the discussions open up quite a debate on the philosophy of mathematics and ends up discussing computer programming, data structures and the Axiom of Choice.

Then there's Einstein's letter on religion which is being auctioned on eBay in which he detailed his views and more importantly his understanding of religion; this is a quote from a 1930's essay by Einstein:
"To sense that behind anything that can be experienced there is something that our minds cannot grasp, whose beauty and sublimity reaches us only indirectly: this is religiousness. In this sense, and in this sense only, I am a devoutly religious man."
Quite profound and I'll finish with an xkcd cartoon:


 :-)


Friday 12 October 2012

DNT Story Update

Just a quick post relating to the previous entry about a possible dystopian DNT-compliant future...the inspiration for the short story was Stephen Baxter's Glass Earth Inc., itself a short story which appeared in the Nokia sponsored collection Future Histories edited by Stephen McClelland.

I have no idea if this book is still available but there do appear to be copies on Amazon and I'm sure Nokia employees from the time probably have a few copies around somewhere. It makes very interesting reading especially given the developments over the past 15 years since its publication.

Stephen McCelland (1997) Future Histories. Horizon House (in conjunction with Nokia). ISBN: 0-9530648-0-8

Sunday 7 October 2012

A Post-"Do No Track" Dystopia?

I remember reading a short story published in a collection of futures that become possible or enabled by the rise of mobile technology. The collection was entitled "Future Histories" edited by Stephen McClelland (link to Amazon) and was produced in part by Nokia.

UPDATE 12 October 2012: The short story is called Glass Earth Inc., by Stephen Baxter appearing in the above collection. See new posting.

NOTE: I have the book, but not to hand - probably in my archive (or in work) so please forgive the missing reference to the particular story in the book - I will correct this as soon as I lay my hands on it.

The story went something like this: in the future advertisers became some omni-powerful that everyone had a quota of advertisements they each had to read or view every day. Interestingly the story had a side plot where McDonalds build a giant Golden-M across the River Thames, presumably next to the then less famous Tower Bridge - something that everyone vehemently protested against until it was agreed that its presence would reduce everyone's daily advertisement quota by a certain amount. I guess that's what you'd call a value proposition...

"Do Not Track" [1] is a proposed W3C standard for adding a header to the ubiquitous HTTP protocol that would instruct servers (specifically first and third parties in the DNT Combined Proposal [2]). While there are many arguments for and against, DNT represents an interesting foray into providing the user with more control over how they are tracked when interacting with internet based services.

Let's for a moment imagine that DNT becomes a standard and browsers and other software implement the functionality and let's also say that compliance becomes enforced by law what unintended consequences could entail?

Since the World Privacy Laws of 2013 all browsing was anonymised, even to the point that details semantic analysis to reveal the user was no-longer possible. Indeed this had triggered some of the deepest research and insights into semiotics, semantics and information theory and its application into everyday life as revolutionary as the original World Wide Web. From the perspective of advertising, today's web was quite unlike the spam filled, intrusive and unstructured advertising mess that so amply characterised the first fifteen or so years of the 2000s.

Initially there was a backlash amongst the advertisers and near war between them, the privacy evangelists and the technology providers. The outcry and resultant, hastily passed laws - initially starting in the EU and Canada and (surprisingly) rapidly spreading to the USA enforced anti-tracking compliance. By mid 2014 most advertisers had given up and the once mighty Google and Facebook struggling with a need to find a new business model.

This new anonymity for users proved to something of a new freedom for users but left much of the commercial side of the internet stagnating. Out of this emerged a compromise: a centralised advertising proxy run by a newly formed company with much experience in this area - GoogleBook - who would guarantee anonymity from the producers and advertisers at the expense of each individual being required not just to view a certain amount of advertisements but to interact with them to ensure that the advertisement had actually been read. A person's quota would become the new currency of the internet and be based upon your social network, your willingness to promote products and ultimately to purchases.

Because of the necessity for personal anonymity, the specific details of the mechanisms of how this worked were somewhat confidential. That didn't seem to overly concern users, nor the privacy advocates, nor the advertisers - everyone got their share - for privacy's sake...

* * *
 

As John sat down at his office computer that morning. London was never easy in the mornings, but a 45 minute trip on the tube gained him 45 advertisement credits - a bargain.

With 200 advertisement credits left for the day, including the deduction for the London-McDonalds Bonus. The arch across the Thames was hideous, but 100 credits deducted from the normal daily tally was worth it. Some even said that next year's proposed Coca-Cola's branding of Tower Bridge might even bring another 100 credits deduction!

It varied, but 200 credits usually meant an hour or so of viewing and interacting with advertisements. Hell, he might just have to make that purchase of a computing device from AppleAmazon Corp: a great offer this month tempted him with its bundled 2000 advertisement credits that would buy him almost a whole day without having to go through this daily routine. Funny how one remembers the days when people complained about Amazon's tablets with targeting advertising on the screen saver...oh those halcyon days of 2012...

It was an inevitable part of the deal..an anonymous internet for forced advertisement consumption via some centralised proxy, or whatever they were - part search engine, part advertiser, part social network.

That always bothered him to a point - most didn't really care - but they always seemed to know what advertisements to show him...a little too good given that the rest of the internet was anonymous; then again they were the only provider of advertisements now. Maybe that's why here was here, despite his job seemingly almost futile now.

The computer played the advertisements and almost subconsciously he clicked each strategically to demonstrate that he had sufficiently read the contents - it took him a while to acquire the skill to do that well enough to fool the system but once gained it freed him to perform some degree of multitasking.

A new breed of advertisements were coming - multiply, cross-referenced adverts that demanded your understanding too.

He used a pen and paper, his little eccentricity...he toyed with writing the line: "in a hole in the ground lived...", instead he penned the title:


Globally Targeted Advertisement Tracking Preference Expression (DNT)
W3C Working Draft 07 October 2018

References

[1] Tracking Preference Expression (DNT). W3C Working Draft 2 October 2012, Eds: Roy Fielding, David Singer
[2] Do Not Track - Combined Proposal, Eds: Aleecia McDonald, 12 June 2012


Sunday 30 September 2012

Grothendieck Biography

Mathematics is full of "characters", Grigori Perelman, Pierre de Fermat, Évariste Galois, Paul Erdős, Andrew Wiles** to name just a few and each having their own, unique, wondrous story about their dedication to their mathematical work and life.

Perhaps none more so than Alexander Grothendieck exemplifies the mathematician and since 1991 lived as a recluse in Andorra. However since the body of work and his contribution to mathematics, particularly, category theory and topology has been almost legendary he retains a great mystery about him.

In order to understand Groethendieck and possibly the mind of the mathematician a series of biographies of Groethendieck are being written by Leila Schneps. The current draft and extracts can be found on her pages about this work.

I'll quote a paragraph from Chapter 1 of Volume II, that gives a flavour of Groethendieck's work and approach to mathematics:

Taken altogether, Grothendieck’s body of work is perceived as an immense tour de force, an accomplishment of gigantic scope, and also extremely difficult both as research and for the reader, due to the effort necessary to come to a familiar understanding of the highly abstract objects or points of view that he systematically adopts as generalizations of the classical ones. All agree that the thousands of pages of his writings and those of his school, and the dozens and hundreds of new results and new proofs of old results stand as a testimony to the formidable nature of the task.

This is truly a work at a scale a magnitude more detailed than, say Simon Singh's fascinating documentation about Wiles' work and proof of Fermat's Last Theorem. Suffice to say, I look forward to reading it. Maybe Simon Singh should make a documentary about Groethendieck?

** I credit Andrew Wiles with inspiring me to study for my PhD back in 1995

Saturday 29 September 2012

Solar Flare Video

Found this via Phil Plait's amazing Bad Astronomy blog*:

On August 31st the Sun produce an immense solar flare:  click here for a picture from Nasa with the Earth to scale or better still just go straight to the 1900x1200 version. I've made a crop of the picture below just to give a teaser:**



Now Nasa and the Goddard Space Flight Center have released a video of the event:


Make it full-screen and switch to 1080p, sit back and be impressed....


* You really should read this blog every day
** Using NASA Imagery and Linking to NASA Web Sites

Thursday 27 September 2012

Teaching Privacy

It often surprises me that many of the people advocating privacy don't actually understand the things that they're trying to keep private, specifically information. Indeed the terms data and information are used interchangeably and there is often little understanding of the actual nature and semantics of said, data and information.

I've run courses on data modelling, formal methods, systems design, semantics and now privacy - the latter however always seems to be "a taster or privacy" or "brief introduction to privacy" and there rarely is the chance to get into specifics about what information is.

This of course has some serious implications and one of the best I can find is when we talk about anonymisation. I've seen horrors such as statements "if you hash this identifier, then it is anonymous" or "if we randomise this data then we can't track" or lately, "if we set this flag to '1' then no-one will track you anymore". In the first case I refer people back to the AOL Data Leak and the dangers of fingerprinting, semantic analysis and simple cross-referencing.

I made a study a while back based on the leak of 16,000 names from various Finnish education organisations (plus maybe other places). It was very interesting to see that even with the released list that contained dates of birth and last names how many were already unique, and even in the cases where there existed common Finnish names how easy it was to trace these back to a unique person. Actually going to the next step and verifying this with that person would I guess have been somewhat illegal or if not, unethical to say the least. Social engineering would have been very easy in many of these cases I'm sure.

So given cases like these and the current dearth of educational material I though it would be nice to try to put together a more comprehensive and deeper set of material. Some universities are already doing this and there also exist industrial qualifications such as those by the IAPP, however at this stage all ideas are welcome.

Now I want to specifically address a technical audience: software engineers, computer scientists - the people who end up building these systems because that's where I feel much breaks down - for many reasons but I won't appoint blame here - that's not really constructive in the current context.

First of all I want to break things down into 3 logical segments, actually there are 4 but I'll discuss that one later:
  • Legal
  • Consumer Advocacy
  • Technical
 and address each area individually.

Legal is relatively straightforward in that an understanding of principles of privacy, how various jurisdictions view data, information, anonymisation, cross-referencing, children and minors, cross-border data transfer, retention and data collection and a discussion of certain practices, eg: EU, US, China, India etc. This discussion doesn't have to be heavy but an understanding of what the law states and how it interprets things is critical. Also from here we should get an understanding of how the law affects the engineering side of things: common terminology as a good example.

Consumer advocacy is really the overview material in my opinion - what are the principles of privacy, for example Cavoukian's Privacy by Design as an example (even if I'm not happy with the implementation of these), how to consumers view privacy, what is the reality (say vs do) and also various case studies such as how consumers view Google, Apple, Nokia, Facebook, various Governments, technologies such as NFC, mobile devices, 'Smart Televisions', direct marketing and advertising, store cards etc. Out of this comes an understanding of how privacy is viewed and even an appreciation of why we don't get privacy: anti-privacy if you like.

The technical aspect takes in many technologies, rather than describe, I'll list them (and this will be non-exhaustive and in no particular order)
  • Basic Security - Web, Encryption, Hashing, Hacking (XSS etc), authentication (OpenID, OAuth etc), differences/commonalities between privacy and security, mapping privacy problems into security problems as a solution
  • Databases - technologies, design, schema development (eg: relational theory), "schema-less" databases, cross-referencing, semantic isolation
  • Semantics - ontologies, classifications, aspect, Semantic Web
  • Data-flow
  • Distributed Systems - networking and infrastructure
  • API design - browsers, apps, web-interfaces, REST
  • Data Collection - primary vs secondary vs infrastructure, logging
  • Policy - policy languages, logic, rules, data filtering
  • Anonymisation - data cleansing
  • Identifiers - tracking, "Do Not Track"
  • User-Interface
  • Metrics for privacy - entropy
  • Information Types and Classification - location, personally identifiable information, identifiers, PCI, health/medical data
as you can see the list is extensive and an understanding of each of these areas is critical to building systems that honour and preserve privacy in its various forms (as described in the consumer advocacy and legal sections). The main point here is to provide software engineers and computer scientists with the tools to implement privacy in a meaningful manner.

Now that we have outlined the three areas we can look at the fourth which binds these together and which I tentatively call "Theory of Privacy".

Obviously something binds these areas together and there does exist a huge body of work on the nature of information and its classifications. I particularly like the approach by Barwise and Seligman in the 1997 book Information Flow: The Logic of Distributed Systems*. I believe we can quite easily get into all sorts of interesting ontology, semantics and even semiotic discussions. Shannon's Information Theory and notions of entropy (eg: Volkstein's book: Entropy and Information) are fundamental to many things. I think this really is an area that needs to be opened up and addressed seriously and anything that binds together and provides a common language to unify consumer advocacy, the law and software engineering is critical.

Finally, no outline of a course would be complete with some preliminary requirements and a book list. For the former an understanding of computer systems and basic computer security is a must (there is no privacy without security), a grounding in software engineering techniques and a dose of computer science similarly. For the books, my first draft list would include:
  • Barwise, Seligman. Information Flow
  • O'Hara, Shadbolt. The Spy in the Coffee Machine: The End of Privacy as We Know It
  • Solove. Understanding Privacy
  • Nissenbaum. Privacy in Content: Technology, Policy, and the Integrity of Social Life
  • Solove: The Future of Reputation: Gossip, Rumour, and Privacy on the Internet

*somebody should make a movie of this.

Monday 10 September 2012

Explaining Primary and Secondary Data

One of the confusing aspects of privacy is the notion of whether something is primary or secondary data. These terms emerge from the context of data gathering and processing and are roughly defined thus:
  • Primary data is the data that is gathered for the purpose of providing a service, or, data about the user gathered directly
  • Secondary data is the data that is gathered  from the provision of that service, ie: not by the user of that service, or, data about application gathered directly
Pretty poor definitions admittedly and possibly overly broad given all the contexts in which these terms must be applied. In our case we wish to concentrate more on services (and/or applications) that we might find in some mobile device or internet service.

First we need to look more at the architectural context in which data is being gathered. At the highest level of abstraction applications run within some given infrastructure:


Aside: I'll use the term application exclusively here, though the term service or even application and service can be substituted.

Expanding this out more we can visualise the communication channels between the "client-side" and "server-side" of the application. We can further subdivide the infrastructure more, but let's leave it as an undivided whole.

In the above we see a single data flow between the client and server via the infrastructure (cf: OSI 7-layer model, and also Tanenbaum). It is this data-flow that we must dissect and examine to understand the primary and secondary classifications.

However the situation is complicated as we can additionally collect information via the infrastructure: this data is the behaviour of the infrastructure itself (in the context of the application). For example this data is collected via log files such as those found in /var/log on Unix/Linux systems, or the logs from some application hosting environment, eg: Tomcat etc. This latter case we have indirect data gathering and whether this falls under primary or secondary as defined above is unclear, though it can be though of secondary, if both our previous definitions of primary and secondary can be coerced in a broader "primary" category. (If you're confused, think of the lawyers...)

Let's run through an example: an application which collects your location and friends' names over time. So as you're walking along, when you meet a friend you can type their name into the app* and it records the time and location and stores this the cloud (formerly known as a "centralised" database). Later you can view who you met, where and at what time in all sorts of interesting ways, such as on a map. You can even share this to Facebook, Twitter or one of the many other social networking sites (are there others?).


Aside: Made with the excellent Balsamiq UI mock-up software.

The application stores the following data:

{ userId, friendsName, time, gpscoordinates }

where userId is some identifier that you use to login to the application and later retrieve your data. At some point in time we might have the following data in our database:

joeBloggs, Jane, 2012-09-10, 12:15, 60°10′19″N 24°56′29″E

joeBloggs, Jack, 2012-09-10, 12:18,  60°10′24″N 24°56′32″E
jane123, Funny Joe, 2012-09-10, 12:18, 60°10′20″N 24°56′21″E

This set of data is primary - it is required for the functioning of the application and is all directly provided by the user.

By sending this data we can not avoid using whatever infrastructure is in place. Let's say there's some nice RESTful interface somewhere (hey, you could code this in Opa!) and by accessing that interface the service gathers information about the transaction which might be stored in a log file and look something like this:

192.178.212.143, joeBloggs, post, 2012-09-10, 12:14:34.2342
64.172.211.10, arbunkleJr, post, 2012-09-10, 12:16:35.1234
192.178.212.143, joeBloggs, get, 2012-09-10, 12:16:37.0012
126.14.15.16, janeDoe, post, 2012-09-10, 12:17:22.0506

This data is indirectly gathered and contains information that is relevant to the running of infrastructure.

The two sets of data above are generally covered by the terms and conditions of using that application or service. These T&C's might also include a privacy policy explicitly or have a separate privacy policy additionally to cover disclosure and use of the information. Typical uses would cover authority requests, monitoring for misuse, monitoring of infrastructure etc. The consents might also include use of data for marketing and other purposes for which you will (or should) have an opt-out. The scope of any marketing request can vary but might include possibilities of identification and maybe some forms of anonymisation.

Note, if the service provides a method for you to share via Facebook or Twitter then this is an act you make and the provider of the service is not really responsible for you disclosing your own information publicly.

So that should explain a little about what is directly gathered, primary information and indirectly gathered information. Let's now continue to the meaning of secondary data.

When the application is started, closed or used we can gather information about this. This kind of data is called secondary because it is not directly related to the primary purpose of the application nor of the functioning of the infrastructure. Consent to collect such information needs to be asked for and good privacy practice suggests that this should be disabled by default. Some applications or services might anonymise the data in the opt-out situation (!). Secondary data collection is often presented as an offer to help with improving the quality of application or service. The amount of information gathered varies dramatically but generally application start, stop and abnormal exit (crashes) are gathered as well as major changes in the functionality, eg: moving between pages or different features. In the extreme we might even obtain a click-by-click data stream including x,y-coördinates, device characteristics and even locations from a gps.

Let's say our app gathers the following:

192.178.212.143, joeBloggs, appStart, 2012-09-10, 12:14:22.0001, WP7, Onkia Luna 700, 2Gb RAM, 75%free, 14 processes running, started from main screen, Elisa network 3G, OS version 1.2.3.4, App version 1.1
192.178.212.143, joeBloggs, dataEntryScreen, 2012-09-10, 12:14:25.0001
192.178.212.143, joeBloggs, gotGPSlocation, 2012-09-10, 12:14:26.0001, 50m accuracy, 3G positioning on, 60°10′19″N 24°56′29″E
192.178.212.143, joeBloggs, dataPosted, 2012-09-10, 12:14:33.2342, 3G data transfer, 2498 bytes
192.178.212.143, joeBloggs, mapViewScreen, 2012-09-10, 12:15:33.2342
192.178.212.143, joeBloggs, dataRequestAsList, 2012-09-10, 12:16:23.1001

What we can learn from this is how the application is behaving on the device and how the user is actually using that application. From the above we can find out what the status of the device was, the operating system version, type of device, whether the app started correctly in that configuration, from where the user started the app, which screen the app started up in, the accuracy and method of GPS positioning and so on.

So far there is nothing sinister about this, some data is required for the operation of the application and stored "in the cloud" for convenience, some data is collected by the infrastructure as part of its necessary operations and some data we voluntarily give up to help the poor application writers improve their products. And we (the user) consented to all of this.

From a privacy perspective these are all valid uses of data.

Now the problems start in three cases:
  • exporting to 3rd parties
  • cross-referencing
  • "anonymisation"
 The above data is fantastic for marketing - a trace of your location over time plus some ideas about your social networking (even if we can't directly identify who "Jane" and "Jack" are .... yet!) provides great information for targeted advertising. If you're wondering the above coördinates are for Helsinki Central Railway Station...plenty of shops and services around there that would like your attention and custom.

How the data is exported to the 3rd party and at what level of granularity is critical for trust in the service. Abstracting the GPS coordinates by mapping to city area or broader plus removal of personally identifiable information (in this case we remove the userID...hashing may not be enough!). The amount of data minimisation here is critical, especially if we want to reduce the amount of tracking that 3rd parties can make. In the above example probably just sending the location and retrieving an advertisment back is enough, especially if it is handled server-side so even the client device address is hidden.

Cross-referencing is the really interesting case here. Given the above data-sets can we deduce "Joe's" friends...taking the infrastructure log file entries:

joeBloggs, Jane, 2012-09-10, 12:15, 60°10′19″N 24°56′29″E
jane123, Funny Joe, 2012-09-10, 12:18, 60°10′20″N 24°56′21″E

and cross-referencing these with the secondary data:

192.178.212.143, joeBloggs, dataEntryScreen, 2012-09-10, 12:14:25.0001
192.178.212.143, joeBloggs, gotGPSlocation, 2012-09-10, 12:14:26.0001, 50m accuracy, 3G positioning on, 60°10′19″N 24°56′29″E
192.178.212.143, joeBloggs, dataPosted, 2012-09-10, 12:14:33.2342, 3G data transfer, 2498 bytes

we can deduce that user joeBloggs was in the vicinity of user jane123 at 12h15-12h18. Furthermore looking at the primary data:

joeBloggs, Jane, 2012-09-10, 12:15, 60°10′19″N 24°56′29″E
jane123, Funny Joe, 2012-09-10, 12:18, 60°10′20″N 24°56′21″E

we can see that joeBloggs mentioned a "Jane" and jane123 mentioned a "Funny Joe" at those times. Now we might be very wrong in the next assumption but I think it is reasonably safe to say, even when we only have a string of characters "Jane" as an identifier we can make a very reasoned guess that Jane is jane123. Actually even the 4 (ASCII) characters that just happen to spell "Jane" aren't even required, though it does help the semantic matching.

This kind of matching and cross-referencing is exactly what happened in the AOL Search Data Leak incident. Which neatly takes me to anonymisation where just because some identifier is obscured doesn't mean that the information doesn't exist.

This we often see with hashing of identifiers, for example, our app designer has been reading stuff about privacy by design and has obscured the identifiers in the secondary data using a suitably random salted hash of sufficient length to be unbreakable for the next few universes - and we've salted the IP address too!

00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f,d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f , appStart, 2012-09-10, 12:14:22.0001, WP7, Onkia Luna 700, 2Gb RAM, 75%free, 14 processes running, started from main screen, Elisa network 3G, OS version 1.2.3.4, App version 1.1
00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f, d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f, dataEntryScreen, 2012-09-10, 12:14:25.0001
00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f, d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f, gotGPSlocation, 2012-09-10, 12:14:26.0001, 50m accuracy, 3G positioning on, 60°10′19″N 24°56′29″E
00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f, d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f, dataPosted, 2012-09-10, 12:14:33.2342, 3G data transfer, 2498 bytes
00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f, d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f, mapViewScreen, 2012-09-10, 12:15:33.2342
00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f, d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f, dataRequestAsList, 2012-09-10, 12:16:23.1001

Aside: Here's a handy on-line hash calculator from tools4noobs.

We still have a consistent hash for an IP address and user identifier so we can continue to track albeit without being able to recover who made and from where come the request. Note however the content line of the first entry:

WP7, Onkia Luna 700, 2Gb RAM, 75%free, 14 processes running, started from main screen, Elisa network 3G, OS version 1.2.3.4, App version 1.1

How many Onkia Luna 700 2Gb owners running v1.2.3.4 of WP7 with version 1.1 of our application are there? Take a look at Panopticlick's browser testing to see how unique you are based on web-browser characteristics.

And then there are timestamps...let's go back to cross-referencing against our primary data and infrastructure log files and we can be pretty sure that we can reconstruct who that user is.

We could add in additional randomness by regenerating identifiers (or the hash salt if you like in some cases) for every session, this way we could only track over a particular period of usage.

So in conclusion we have presented what is meant by primary and secondary data, stated the different between directly gathered data and indirectly gathered data and explained some of the issues relating to the usage of this data.

Now there are some additional cases such as certain kinds of reporting, for example, music player usage and DRM cases which don't always fall easily into the above categories unless we define some sub-categories to handle this. Maybe more of that later.


*apps are considered harmful






Tuesday 28 August 2012

Opa Tutorial - Intermission

Cedric Soulas of MLState - the creators of Opa - has very kindly put up some suggestions on writing code based on my tutorials. They're available via GitHub at https://github.com/cedricss/ian-oliver-tutorials . My code resides over on SourceForge.

While these are based on part 4 of the series they are applicable to later parts and indeed I will integrate these in due course.

One thing I do want to point out now is the use of a parser instead of an explicit pattern match dispatching URLs to specific functions. Very briefly the code which reads:

function start(url)
{
  match (url)
   {
       case {path: [] ... }: hello();
       case {path: ["expressions" ] ... } : expressionsRESTendpoint();
       case {path: ["expressions" | [ key | _ ] ] ...} : expressionWithKeyRESTendpoint(key);
       case {~path ...}: error();
   }
}

Server.start(
   Server.http,
     [ {resources: @static_include_directory("resources")} , {dispatch: start} ]
);


can be rewritten using a parser:

start = parser {
   case "/": hello();
   case "/expressions" : expressionsRESTendpoint();
    case "/expressions/" key=((!"/".)*) (.*):      expressionWithKeyRESTendpoint(Text.to_string(key));
   default: error();
}

Server.start(

   Server.http,
   [ {resources: @static_include_directory("resources")} , {custom: start} ]
);


Which performs more or less the same function with the bonus that we obtain much more flexibility in how we process the URL - rather than treating it statically in the pattern matching. Note how we're no longer calling an explicit function but now have a parser as first-class object :-)


At this time there's relatively little documentation on the parser in Opa, but on the Opa Blog there's a tantalizing glimpse of what is possible.


So with that to whet your appetites go and download the latest nightly build and start coding...while I finalise composing part 5...

Tuesday 21 August 2012

Semantic Isolation and Privacy

Somewhat side-tracked in writing these (pt1, pt2, pt2.5) and thinking about how best to explain some of the issues, especially when getting to the deeper semantic levels. However a work discussion about Apple and Amazon's security flaws and the case of Mat Honan provided an interesting answer which I think describes the problem quite well.

In the above incident hackers used information from Amazon to, primarily, social engineer Apple's customer service into believing that the hackers were Mat Honan. From the Wired article (linked above and [1]), Honan provides the quote:

But what happened to me exposes vital security flaws in several customer service systems, most notably Apple’s and Amazon’s. Apple tech support gave the hackers access to my iCloud account. Amazon tech support gave them the ability to see a piece of information — a partial credit card number — that Apple used to release information. In short, the very four digits that Amazon considers unimportant enough to display in the clear on the web are precisely the same ones that Apple considers secure enough to perform identity verification. The disconnect exposes flaws in data management policies endemic to the entire technology industry, and points to a looming nightmare as we enter the era of cloud computing and connected devices.

Both accounts at Apple and Amazon have a user identifier and passwords; they also have a set of other criteria to establish whether a given human is who they say they are. In Amazon's case they ask for email, address and the titles of some books you've bought from them - at least last time I called. In Apple's case it looks like they wanted some more personal information, in this case the "least significant digits" of a credit card number. I say least significant because these particular digits are often printed in plain text on receipts. As far as Visa, MasterCard, Diners etc consider, these digits have no meaning - though I have an issue with that as we shall see.

In Amazon's context the data about a user is semantically isolated from Apple's context. This is a level deeper than saying that they both had user identifiers but to what Real World concept and instance those identifiers meant and represented. The trouble here started when the realisation was made that the instance that the Amazon identifer related to could be the same as the thing that the Apple identifier related to, in this case the Real World Mat Honan. To make this complete it also turns out that the meaningless four least significant credit card digits in Amazon's context were the proof of identity in Apple's context.

We can argue that the data and identity management procedures in both cases were at fault, however in analysis this was actually hard to see: How could 4 random digits, effectively uniquely identify a person and without an understanding of each other's semantic view of the world, who would have realised?

The whole hack described in Mat Honan's article goes into a lot more detail on how this information was found out. Indeed much of the information required is already public and it is just a case of putting all this together and effectively using that to make a consistent profile.

As for credit card numbers, the practice of displaying the final four digits or the "least significant digits" in certain semantic contexts is called PAN truncation. However as the whole number has a well defined structure (ISO/IEC 7812), check summing and only a limited number of options for the rest of the digits is becomes feasible to reconstruct much of the number anyway, especially as some receipts also print the card type - at least enough to sound convincing in a social engineering situation if necessary. Furthermore as described in the article faking credit card numbers because of their structure actually now becomes a method of generating data to prove identity in some cases. In summary, there are no "random digits" or "least significant digits" in a data structure with particular meanings associated with each part of that structure.

The situation gets worse when more information can be provided for the social engineering exercise: for example, in Finland it used to be common before chip and pin terminals for a shop cashier to ask for identity, where the customer would show a "valid identitiy document" (this varied by cashier, shop and day-to-day in some cases) and certain details would be written down: usually the last four (least significant aparently) digits of a Finnish social security number or a whole passport number plus other varied details depending upon the shop and phase of moon etc.


References

[1] Mat Honan. 2012 How Apple and Amazon Security Flaws Led to My Epic Hacking. Wired Aug 6, 2012

Sunday 19 August 2012

Opa Language Tutorial: Part 4

Following on from part 3, we now quickly finish everything by handling all the GET, POST, PUT and DELETE verbs for our two cases:

Verb /expressions /expressions/"k"
GET return a list of expression identifiers even if empty. Return a 200 success code return the full details of expression "k". If "k" doesn't exist then return a 404 error
POST add the requested object if the supplied key in the object doesn't exist. Return a 201 success code as well as the key "k", otherwise return a 400 error. not allowed, return a 400 error
PUT not allowed, return a 400 error modify the database with the given object if the supplied key both exists in the database and matches the key in the supplied object - return a 200 success code. In all other circumstances return a 400 error.
DELETE not allowed, return a 400 error delete the database entry with the given key if it exists in the database and return a 200 success code. In all other circumstances return a 400 error.

Let's go through each of the cases above in turn with minimal discussion of the code. I'm also going to tidy up the code a little, correcting the HTTP responses, making sure we're calling the correct messageSuccess and messageError functions accordingly.

For the case where we deal with urls of the form http://xxx/expressions, ie: without any key we make the matching in the second case statement in the start function:

function start(url){
  match (url)   {
       case {path: [] ... }: hello();
       case {path: ["expressions" ] ... } : expressionsRESTendpoint();
       case {path: ["expressions" | [ key | _ ] ] ...} : expressionWithKeyRESTendpoint(key);
       case {~path ...}: error();
   }
}

which detects URLs without a key associated and passes these off to the expressionsRESTendpoint function:

function expressionsRESTendpoint(){
   match(HttpRequest.get_method())   {
      case{some: method}:
         match(method)         {
             case{get}:
                expressionsGet();
             case{post}:
                expressionsPost();
             case{put}:
                messageError("PUT method not allowed without a key",{bad_request});
             case{delete}:
                messageError("DELETE method not allowed without a key",{bad_request});
             default:
                messageError("Given REST Method not allowed with expressions",{bad_request});                  }
      default:
          messageError("Error in the HTTP request",{bad_request});
   }
}


which matches the GET and POST verbs, and then anything else with an HTTP 400 Bad Request, plus a natural language error message.

Because network programming suffers from the leaky abstraction problem we also need to catch the cases where we fail to get a method, in this case the default: of the outer match block handles this.

The functions expressionsPost and expressionsGet are as described earlier in part 3.

The case where a key is supplied with the URL, this is handled by the 3rd case statement in the start function and control is passed to the expressionWithKeyRESTendpoint function , which operates in the same way as the "without key" case:

function expressionWithKeyRESTendpoint(key){
   match(HttpRequest.get_method())   {
      case{some: method}:
         match(method)         {
             case{get}:
                expressionGetWithKey(key);
             case{post}:
                messageError("POST method not allowed with a key",{bad_request});
             case{put}:
                expressionPutWithKey(key);
             case{delete}:
                expressionDeleteWithKey(key);
             default:
                messageError("Given REST Method not allowed with expressions with keys",{bad_request});
         }
      default:
          messageError("Error in the HTTP request",{bad_request});
   }
}


The procedure to GET is relatively straight forward in that we check that a record with the given key exists and match the result accordingly:

function expressionGetWithKey(key){
    match(?/regexDB/expressions[key])    {
       case {none}:
          messageError("No entry for with id {key} exists",{bad_request});
       case {some: r}:
          Resource.raw_response(
             OpaSerialize.serialize(r),
             "application/json",
             {success}
          );
    }
}


Deletion is also relatively straightforward:

function expressionDeleteWithKey(key){
    match(?/regexDB/expressions[key])    {
       case {none}:
          messageError("No entry for with id {key} exists",{bad_request});
       case {some: r}:
          Db.remove(@/regexDB/expressions[key]);
          messageSuccess("{key} removed",{success});
    }
}


The expression: ?/regexDB/expressions[key]

is used to check existence, returning an option type which we handle

To remove a record from a database we use the function Db.remove which takes the record as a parameter. Note the use of the @ operator which returns the a reference path to record in the database. Opa's database functions are fairly comprehensive and are better explained in Opa's own documentation - specifically in this case section 14.7.

Now we get to the PUT case; to understand this properly we need to break this down:
  1. First we check if the request contains a body.
  2. If successful, we check that the supplied key exists
  3. If successful, we deserialise the body, which should be JSON
  4. If successful, we convert this to an Opa record and match it to the type regexExpression.
  5. If the key supplied in this object (exprID field) matches the key used in the URL then we simply replace the record in the database in the same manner as we make with our earlier POST function.
Otherwise, there is nothing really special about this particular function, though we do use an if...else structure for the first time and this should be familiar already.

function expressionPutWithKey(key){
 match(HttpRequest.get_body()) {
  case {some: body}:
    match(?/regexDB/expressions[key]) {
       case {none}:
          messageError("No entry for with id {key} exists",{success});
       case {some: k}:
           match(Json.deserialize(body)) {
                case{some: jsonobject}:
                    match(OpaSerialize.Json.unserialize_unsorted(jsonobject))
                      case{some: regexExpression e}:
                         if (e.exprID == key)  {
                            /regexDB/expressions[e.exprID] <- e;
                            messageSuccess("Expression with key {e.exprID} modified",{success});
                         }
                         else  {
                            messageError("Attempt to update failed",{bad_request});
                         }
                      default:
                         messageError("Missing or malformed fields",{bad_request});
                    }
                 default:
                       messageError("No valid JSON in body of PUT",{bad_request});
               }
           }
   default:
      messageError("No body in PUT",{bad_request});
 }


...and with that we conclude the initial development of the application or service (depending on your point of view) and shown basic database interaction, simple processing of URLs, handling of JSON serialisation/deserialisation and handling of the major HTTP methods: GET, POST, PUT and DELETE, plus some simple error handling.

The code up to this point is available on SourceForge and the file you're looking for id tutorial4.opa.

Now for a little discussion...is this the best we can do and why Opa? To answer the second question first, Opa gives us a level of abstraction in that we've not had to worry about many aspects of programming internet based services that many other frameworks have. In a later part we'll talk more about how Opa decides to distribute code between the client and server plus how to use Opa's features for distributing code for scalability and fault-tolerance. So Opa is making a number of architectural decisions for us, in part this is embedded in the nature of Opa - being a compiled language not an interpreted language. Furthermore Opa is strongly typed which means all of our typing errors (well most) are already caught at compile time. This simplifies debugging and forces a more disciplined approach to programming; there is however a caveat (or two) to this last statement.

The code written here is fully functional, written in an agile manner (which may or may not be a good thing) and also written in a rigorous manner (refinement of specification, testing etc). What is wrong with the code is that it is profoundly ugly and that comes from the style of development which in this case has been based around developing a set of individual use cases (eight of them, 2 families of access and 4 verbs).

While use case based development provides us with a good deal of the information we need to know about how our system needs to behave and in what cases, and indeed the individual use cases compose quite nicely - our program works doesn't it? - this does not result in an elegant, maintainable piece of software that performs nor satisfies our needs well. For example, if we need to change how we access the database, or even reuse functionality we end up reintroducing similar code again and often trying to find every instance of code that performs that functionality and modifying that consistently. For example, look how many times we've needed to check whether deserialisation needs to be performed and checked for, or that notice how the patterns for each of the two major use case families (with keys and without keys) are broadly similar but we have repeated code.

What we're left with here is building to be a massive amount of technical debt - once we've added code to manage sets of expressions this becomes painfully obvious; I'm not going to do that in this series, I want (need!) to rewrite this better now. Read the interviews with Ward Cunningham and Martin Fowler about this and you'll see why the code here isn't that elegant. In the next parts of this series I'll refactor the code, take more advantage of Opa's functional nature and show how architecting our design properly, worrying about separation of concerns produces more maintainable code with much higher levels of reuse.