FTRM

I’m pleased to announce the release of an open specification for terminology: FTRM

  • SQLite is a fantastic application file format - SQLite database files are portable between 32-bit and 64-bit machines and between big-endian and little-endian architectures and between any of the various flavors of Windows and Unix-like operating systems.
  • SQLite is accessible, portable and reliable

If we as a community agree an open specification, we can make it easier to meaningful share terminology data such as code systems, value sets and maps across systems.

Introduction

If you build clinical software, whether operational or as part of data analytics, sooner or later you have to ship terminology — e.g. SNOMED CT, a value set for a research cohort, a concept map between two coding systems. The FHIR terminology resources (CodeSystem, ValueSet, ConceptMap, NamingSystem) are the right abstraction. But the distribution unit — the file you actually copy around, install on a server, drop into a CI pipeline, or hand to a colleague — has been a recurring source of friction.

You can ship raw FHIR JSON. That works for a few hundred kilobytes of conformance resources, and falls apart at the scale of even a modest national subset. You can stand up a Postgres instance and load it from a vendor’s bespoke pipeline. That works for a single deployment, and breaks the moment you need a second one or want to hand a snapshot to someone outside your organisation. Or you can serve everything from a remote terminology service over HTTP — fine when you have the network, untenable when you do not. Making billions of server round-trips to a national terminology ‘server’ isn’t going to work - most throttle requests.

I’ve long felt being able have the flexibility to run software like this on your own machine is important, and it is ideal if that code is open-source.

Why a file format?

What’s been missing is a single file you can email, scp, drop in S3, or commit to a release artefact, that contains the canonical terminology resources, is fast enough to serve directly, and is portable between implementations.

I’d like to introduce: FTRM — a FHIR Terminology Container. It is an open specification for a single SQLite database that holds FHIR CodeSystem, ValueSet, ConceptMap and NamingSystem resources, with enough structure to serve the canonical terminology operations directly off the file — $lookup, $validate-code, $expand, $translate, $subsumes.

A note on scope: FTRM v1 is FHIR-shaped. It does not define SNOMED CT’s expression constraint language (ECL) or reference-set semantics, and it is not where I currently put SNOMED. The reference implementation, hades, serves SNOMED CT through a separate dedicated store — hermes, built on LMDB and Lucene, with native ECL evaluation — and uses FTRM for LOINC and HL7 FHIR NPM packages. The composite catalogue dispatches by canonical URL, so from a client’s perspective it is one terminology server. The separation is intentional: SNOMED-specific operations benefit from SNOMED-specific data structures, and FTRM’s job is to be the boring, portable container for everything that fits the plain FHIR terminology shape.

Why SQLite

The choice of SQLite is deliberate. SQLite is the most-deployed database in the world; it is in every browser, phone and operating system. The file format is stable for the long term and has a published, public-domain reference implementation. There is no daemon, no orchestrator, no clustering layer to install. A FTRM file is just bytes — you can sha256sum it, version it, mirror it, and verify it as you would any other release artefact.

And critically: a FTRM file is queryable with stock tooling. If you want to know how many concepts a code system has, how its hierarchy is shaped, or which value sets reference which systems, you open the file with the sqlite3 CLI and write SQL. There are no opaque binary blobs, no implementation-defined indexes you have to reverse-engineer, no proprietary on-disk structures. Everything is documented, including the foreign keys and the FTS5 indexes.

What’s in the file

The schema separates two kinds of fields: the ones a server needs to reason about, and the ones it has to round-trip without understanding.

The first set — case_sensitive, hierarchy_meaning, status, experimental, publisher and so on — are real columns that you can index and filter. The second set — copyright, contact points, language extensions, useContext, the long tail of FHIR metadata that varies between resources but that a terminology server doesn’t dispatch on — round-trips through a single JSON column called metadata. That keeps the schema honest about what matters operationally, while preserving fidelity for everything else.

Concept properties follow the same idea. The four well-known concept properties (inactive, abstract, notSelectable, status) are projected onto columns of the concept table, so the common filters are cheap. Everything else lives in concept_property with FHIR’s value[x] polymorphism mirrored as typed columns (value_str, value_int, value_bool, …).

The concept_ancestor table is the transitive closure of the parent edges, with the shortest-path depth as a column. Subsumption queries (is-a, descendent-of) become a single indexed lookup rather than a recursive CTE. Writers rebuild the closure when they write the file; readers trust it.

Free-text search uses SQLite’s FTS5, with unicode61 remove_diacritics 2 baked into the contract so cafe matches café regardless of where the file was built. That detail matters — if FTS tokenisation drifts between writers, the same query returns different results on different files, and the format stops being portable in any meaningful sense.

Versioning

FTRM stores an arbitrary number of versions of any resource. The identity tuple is (resource_type, url, version). Selection between versions is the caller’s responsibility, driven by the FHIR system-version and force-system-version parameters at request time. There is no ambient “current version” baked into the file — which means a single FTRM file can hold this year’s LOINC release alongside a frozen snapshot from two years ago, or successive versions of hl7.terminology.r4, and a research pipeline can reproduce its inputs by referring to versions explicitly.

Writes are replace-by-tuple: a writer that re-imports a (CodeSystem, url, version) triple deletes the dependent rows keyed by that tuple and re-inserts. There is no patch-in-place journal. The file is a snapshot.

File identification and conformance

Every FTRM v1 file stamps two values into the SQLite header: application_id = 0x4654524D (ASCII FTRM) and user_version = 1. A reader that opens a file without the FTRM stamp must refuse it. That’s a sanity gate: it stops a generic SQLite file being mistakenly treated as a terminology container, and it gives file(1) something to recognise.

The specification defines what it means to be a conforming reader and a conforming writer. A reader that enumerates tx_resource, registers each row as an addressable resource, and honours the well-known property columns at lookup time — and the FTS indexes for text search, and concept_ancestor for filter pushdown — is conformant. A writer that produces files passing PRAGMA foreign_key_check and rebuilds the ancestor closure and FTS indexes is conformant. The list is short and operational, not aspirational.

Why this matters

The point of an open specification is that no single vendor owns the format. The schema, identifiers and prose of FTRM v1 are released under CC0 — the public-domain dedication. Anyone can write a reader or a writer, in any language, without asking permission and without licensing friction. The reference implementation is hades, my open-source FHIR terminology server, and it passes 81.8% of the HL7 FHIR Terminology Ecosystem IG conformance suite at the time of writing. But the format does not depend on hades, and a writer in Python, Java, Rust or Go that follows the spec produces files that any other conforming reader can serve.

For the NHS, where I sit, this matters in a very specific way. We have years of accumulated experience with terminology distribution that has been awkward at best. SNOMED CT, dm+d, the NHS Data Dictionary, the Welsh national subsets — each comes with its own pipeline, its own format, its own assumptions about what a server will do with it. Layering a FHIR-shaped, SQLite-backed, vendor-neutral container on top of that does not solve the underlying provenance and governance questions, but it does mean that once those questions are answered, the artefact you ship is boring, ubiquitous and inspectable.

The same applies to research. A reproducible cohort study should be able to declare its terminology inputs by file hash. Today, that typically means archiving a tarball of source releases and a loader script. With FTRM, it means archiving a single .db file and a sha256 — and any reader can serve it.

What’s next

FTRM is at v1. It is stable; breaking changes will bump the schema version and ship migration tooling. The spec lives in the hades repository at doc/ftrm.md, and the reference DDL is the SQL file the implementation actually runs against. There is no separate paper.

If you build, ship, or consume FHIR terminology, please read it, try it, and tell me where it is wrong. The schema is the contract; sharpening that contract is how the format becomes useful beyond my own work.

Mark