I recently contributed to a Wired UK article on privacy and healthcare. Here is my full response.

Should we worry about ‘Big Data’/AI/machine learning in healthcare?

In my opinion, the two main concerns relate to privacy and value: fundamentally, privacy simply relates to the risk to individuals of sharing information about those individuals and value relates to the intrinsic worth of that data as well as the value of any algorithms or software produced as a result of analysis of that data.

It is difficult to explore the privacy concerns without clarifying what we mean by data and how it is already used, particularly as privacy depends fundamentally on the purpose to which the data is processed and by whom. For example, in medical settings, you might expect me as your neurologist to have access to your relevant medical information from other healthcare professionals, such as reports, documents and the results of investigations such as scans and laboratory investigations. Currently, most such data is held by NHS organisations or commercial companies operated for and behalf of those healthcare providers. There are, therefore, already many current situations in which selected data is shared externally with healthcare, social care and commercial entities. Such data are used for direct care of individual patients and for managing our clinical services. Likewise, there are already many situations in which routine NHS data is combined and de-identified in some way and made available for research such as the CPRD (https://www.cprd.com/home/) and SAIL (https://saildatabank.com). Such systems usually try to remove obviously identifiable information such as name and date of birth and instead generate categorical variables (e.g. year of birth) to try to obfuscate the identify of individuals. Have a look at https://understandingpatientdata.org.uk/what-does-anonymised-mean in relation to de-identified, anonymous and pseudonymous data.

Unfortunately, there is not a discontinuity between identifiable and non-identifiable information but instead a spectrum running from low to high risk of re-identification. This risk increases as more and more data are made available and this is where the concept of “big data” results in privacy concerns. You may not be able to identify me from only knowing I’m a neurologist, but if you know I work in Wales and have an interest in information technology then you’re likely to know it’s me. As a result, “big data” in which large amounts of different types of data are made available and aggregated can undermine techniques to de-identify that data.

The next question is, does sharing identifiable or re-identifiable data matter? In general, the answer depends on who is doing the processing and the consequences, benefits and risks from that processing. People naturally have different expectations for the usage of data depending on whether it is for direct care, for clinical research, for a perceived “greater-good” or for commercial means. The care.data fiasco demonstrated the strong public concern relating to the release of information to commercial such as the insurance industry (e.g. see https://www.theguardian.com/commentisfree/2014/feb/28/care-data-is-in-chaos). Fundamentally therefore, this means that we need strong and appropriate legal and regulatory frameworks to ensure that we take advantage of the enormous opportunities inherent in combining these sources of data with modern technology to improve patient care and clinical research. The new GDPR legislation extends the rights of individuals to control the purposes to which their information is used, and we in healthcare must take these changes into account when considering “big data” and machine learning. See https://ico.org.uk/for-organisations/data-protection-reform/overview-of-the-gdpr/

Similarly, the issue of “value” depends on who gains that value. If an algorithm is developed that can, for example, screen a chest x-ray to determine whether it is normal or abnormal, with only the latter images needing formal radiologist review, then there would a huge interest across the world in commercialising that technology and free up valuable time for radiologists. Who should own that technology? The services providing the data or the technology companies who have built the systems to learn from that data? For me, this should be a partnership between health and technology as both need each other in order to make progress to potentially improve patient care. As a result, I’d always expect explicit clauses relating to intellectual property rights to be included in data sharing agreements in which software and algorithms may be of commercial value.

What other privacy-protecting/enhancing technologies are there?

I really like “Understanding patient data” : see https://understandingpatientdata.org.uk/how-data-kept-safe and its explanations.

Essentially, it is necessary to have a mixed set of strategies as there is no single “magic bullet” that magically preserves or enhances privacy. Instead, we need a combination of appropriate legal and regulatory frameworks (as I mentioned above), independent review and audit, formal software specifications and methodologies and, most importantly, a conversation with everyone about the benefits and risks of the use of patient data to improve care, now and in the future. As part of that national debate, we must determine the scenarios in which different levels of consent are required, such as an explicit opt-in for clinical research or an opt-out for other uses. At the same time, we must increase general public awareness about the benefits and risks of the wider use of health information. Ideally, data should be de-identified as much as possible for the specific use-case in question so that we instead start to talk about “risk of re-identification” and it is likely that different levels of risk will be acceptable for different use-cases.

There are additional technologies to add into the mix such as “differential privacy” (see https://en.wikipedia.org/wiki/Differential_privacy for more information) which introduce randomness into aggregated data in order to reduce the risk of re-identification and yet preserve conclusions made from the use of that data. Another interesting research development has been homomorphic encryption allows information such as private medical data to be encrypted and subsequently processed without needing decryption. Have a look at https://en.wikipedia.org/wiki/Homomorphic_encryption but such technology is, as far as I am aware, at a very early stage as it is extremely computationally-intensive.

Are we considering privacy enough?

Probably not!

Health data becomes even more complex when one considers the ever increasing use of new technologies in healthcare. I have written about these in my “disruptive technologies” blog post (see http://wardle.org/clinical-informatics/2017/07/07/disruptive-influences-health.html) and in particular, how we deal with data from multiple sources such as devices carried by the patient or installed in the home. As a result, there will be an increasing number of “attack surfaces” in which confidential medical information may be leaked. For instance, how will we deal with the data sent from a device designed to monitor an elderly patient and check for falls, or whether they have left their home, or their activity? How will we ensure that only the right people receive that information? The same issues apply to medical and health applications aimed at consumers which may leak confidential medical information. There are many examples unfortunately - see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419898/ for just some of the examples.

A useful resource is the “privacy-by-design” page from the ICO: see https://ico.org.uk/for-organisations/guide-to-data-protection/privacy-by-design/ in which organisations undertake a privacy impact assessment to reduce potential harm to patients. Such tools are only useful when they are used, so the concept of sandboxes is also important, particularly for mobile device “apps”. For example, on most mobile devices, an application must request permission to access contacts, the camera or onboard health data and checks of those permissions are not a function of the application to implement (or not) but a function instead of the sandbox in which the application runs and all data access can be logged. So I think device manufacturers have a role in checking and verifying applications in their app stores and programmes like NHS digital’s app store (see https://apps.beta.nhs.uk) add an additional layer of inspection and curation.

Conclusions

I can’t add much more to what Dame Fiona Caldicott said : https://www.gov.uk/government/news/national-data-guardian-ndg-statement-on-government-response-to-the-ndg-review

My main conclusion is that there is no single answer but a combination of solutions :

give people choice,
engender public trust by
- appropriate regulatory and legislative frameworks,
- adopting appropriate technical safeguards and de-identify information at a level that minimises re-identification risk and satisfies the specific requirements for analytics on that occasion,
- using verifiable logs to ensure access to confidential information is logged and capable of inspection.

Medical data is terrifically valuable, powerful and offers tremendous scope to do good, but we also have a great responsibility to protect those data and ensure access is safe, secure and transparent.

Best wishes,

Mark