An update on Merkle trees

In a previous post, I referenced Merkle trees in relation to the creation of a verifiable audit log. I broke a cardinal rule by discussing something for which I (then) had little knowledge.

In my defence, the technology is rather clever so I don’t regret writing about it. However, I’ve had some help correcting my misunderstandings, so I thought it would be prudent to write down what I’ve learned. Any remaining errors are mine.

An introductory article on this technology can be found on the Government digital service’s site.

Merkle trees are more interesting than I thought

I originally wrote:

It uses a nested tree design called Merkle trees but this is simply an optimisation to limit the computational requirements whenever data is added.

It turns out that the use of Merkle trees is more than simply an optimisation of a hash list, although that is a very important benefit. The tree structure allows rapid proof that an entry is in a log by simply checking a branch of nodes (log(n)), rather than all nodes, and so the whole data structure can be verified without transferring the whole contents. This isn’t just an optimisation but permits validation of databases of potentially unbounded size.

Similarly, this same can be said for replicating data in a distributed system. For example, if we are replicating data then we can start with the root of the tree (a list of one hash value) and compare the hashes. If there are no differences, the replication can stop, but otherwise, the process repeats with the child nodes in a recursive manner.

In addition, a Merkle tree can be used as a key-value store. A ‘map’ in programmer’s terms is a data structure in which a value can be stored associated with a key; such a structure is called an associative array. Usually, to avoid having to iterate through all entries in order to find the one that matches the key, a hash function is used to assign a key into (hopefully) a unique bucket. This ‘hashmap’ makes it much quicker to find the value for a particular key.

When creating an append-only log, a Merkle tree is “dense”; there are no gaps and the tree is populated conceptually from left-to-right at successive indices and each hash is a hash of the child nodes.

In “map” mode, instead of adding entries left-to-right, entries are added using the hash as the index. Thus, the tree is populated piecemeal. Now if you’ve been reading carefully, you might then assume that a Merkle tree in map mode would be impossible to implement, as there would be 2^256 buckets and an enormous tree of hashes to calculate and validate. However, almost all entries are empty. This is why this type of Merkle tree is termed a “sparse Merkle tree”, and we can derive identical hashes for most branches within the tree. Thus, a practical implementation becomes feasible. In the same way as a verifiable log, a map may be synchronised by recursively comparing hashes from the top down, stopping prematurely at a level if the hashes match.

It is possible to combine the benefits of a key-value store and log when one uses a verifiable log to store the transactions which occur on the key-value store, in a so-called “verifiable log-backed map”. A primer on these verifiable data structures is available as part of Google’s own implementation, Trillian.

Deepmind Health’s verifiable audit log

I also wrote:

Presently, I understand that this is designed to be a centralised record. Such centralisation means that potentially, whoever controls the ledger can change the ledger, simply by re-computing the hashes of the data to reflect the change. Computationally, this may be inefficient, but the verifiable log only becomes verifiable when the ledger is held by a trusted party. For the Deepmind collaboration, I would imagine the NHS organisation holds the ledger while Google writes to it.

This isn’t quite true either.

Importantly, the log itself can potentially be held by a third-party entity such as Deepmind but in order to be verifiable, the NHS organisation needs to monitor and periodically verify that no-one has tampered with the log. This means that there needs to be a trusted infrastructure which partner organisations may use; after all, a log helps our cause only if someone is actually checking it!

But how does one prevent a ‘nefarious’ organisation making use of data inappropriately simply by not logging the data access in the first place? Such an issue emphasises the importance of non-technical factors in obtaining and maintaining trust in relation to sensitive information. Is it sufficient to perform code review and audit in order to create trust? Perhaps.

I am interested in promoting technical mechanisms which can guarantee that data access is logged appropriately and I suspect that a combination of technical and non-technical measures will be required. I have discussed these measures in a previous blog post on information governance and innovation.