Codelists
If a patient has a type of motor neurone disease, then…
If a patient takes an ACE inhibitor, then …
If a patient has a type of autoimmune disease, then …
Whether we are writing software to build compelling user-facing applications, building rules or decision support, or choosing categories from which we will run analyses such as understanding patient outcomes, we need to be able to process health and care data and make inferences.
My open source software codelists generates versioned codelists for reproducible data pipelines and research.
In general, there are two ways to think about codelists and reproducibility.
The first is a explicit human curation of a list of codes. This is the approach adopted by Ben Goldacre and the opencodelists team. You create and share codelists.
The second is to define a codelist using a declarative specification which can be used to dynamically - but reproducibly - generate the codelist.
Let’s look at an example. opencodelists has a codelist to specify terms that represent a referral to colorectal services under a 2-week wait (urgent) basis. See https://www.opencodelists.org/codelist/phc/2ww-referral-colorectal/7eac259d/#full-list
At the time of writing, this manually curated list includes one active concept and two inactive concepts.
276401000000108 Fast track referral for suspected colorectal cancer
276411000000105 Urgent cancer referral - colorectal
276421000000104 Urgent cancer referral - colorectal
While this is a useful set of curated terms, I would argue that it is better to simply define this codelist using a SNOMED CT constraint using the syntax of the SNOMED CT expression constraint language:
{
"ecl": "<<276401000000108"
}
Result:
=> #{276401000000108 276411000000105 276421000000104}
Based on a named versioned distribution of SNOMED, and defined versions of this
tool, this specification can be used to generate a reproducible codelist. If SNOMED CT
changes over time, this specification will continue to work, due to the semantic
relationships within SNOMED CT. codelists
can expand a set of codes to include
now inactive concepts using historical associations.
Certainly for drugs, a declarative rules approach works better than the manually
curated set of opencodelists. If new drugs of a type are added to the UK dictionary
of medicines and devices (dm+d), then codelists
will include those new drugs
without any manual intervention, while manual curation requires continued
monitoring and maintenance of code lists.
How to use codelists
You can define codelists using a variety of means, such as
- ICD-10 codes for diagnoses
- ATC codes for drugs
- SNOMED CT expressions in the expression constraint language (ECL).
You can combine these approaches for high sensitivity, or manually derive codelists using hand-crafted ECL for high specificity.
codelists
is a simple wrapper around two other services - hermes
and dmd. I think it is a nice example of composing discrete, but related services
together to give more advanced functionality.
codelists
operates:
- as a library and so can be embedded within another software package running on the java virtual machine (JVM), written in, for example java or clojure.
- as a microservice and so can be used as an API by other software written in any language
The substrate for all codelists is SNOMED CT. That coding system is an ontology and terminology, and not simply a classification. That means we can use the relationships within SNOMED CT to derive more complete codelists.
If you only use the SNOMED CT ECL to define your codelists, then simply use hermes
directly.
You only need the additional functionality provided by codelists
if you are building codelists
from a combination of SNOMED CT ECL, ATC codes and ICD-10.
ATC maps are not provided as part of SNOMED CT, but are provided by the UK dm+d. ICD-10 maps are provided as part of SNOMED CT.
Using codelists
You can realise a codelist, expanding it to all of its codes. You can also test membership of a given code against a codelist.
All codelists, by default, expand to include historic codes. This will become configurable, but is the default for greater sensitivity at the expense of specificity. Different trade-offs might apply to your specific project.
Boolean logic is supported, with arbitrary nesting of your codes using a simple DSL.
A codelist is defined as names and values in a map, with the names representing the codesystem and the values the specification.
{
"ecl": "<<24700007"
}
This defines a codelist using the SNOMED expression constraint language (ECL). While ECL v2.0 supports the use of historic associations within constraints, I usually recommend ignoring that ‘feature’ and instead defining whether and how historic associations are included as part of the API.
SNOMED CT, in the UK, includes the UK drug extension with a 1:1 map between SNOMED identifiers and drugs in the official UK drug index - dm+d (dictionary of medicines and devices). That means you can use a SNOMED expression to choose drugs:
{
"ecl": "(<<24056811000001108|Dimethyl fumarate|) OR (<<12086301000001102|Tecfidera|) OR (<10363601000001109|UK Product| :10362801000001104|Has specific active ingredient| =<<724035008|Dimethyl fumarate|)"
}
Note how SNOMED ECL includes simple boolean logic.
But `codelists’ supports other namespaced codesystems. For example:
{
"atc": "L04AX07"
}
Will expand to a list of SNOMED identifiers that are mapped to the exact match ATC code L04AX07 and its descendents within the SNOMED hierarchy.
A SNOMED CT expression in the expression constraint language must be a valid expression. ICD-10 and ATC codes can be specified as an exact match (e.g. “G35”) or as a prefix (e.g. “G3*”). The latter will match against all codes that begin with “G3”.
Different codesystems can be combined using boolean operators and prefix notation:
{
"or": [
{
"atc": "L04AX07"
},
{
"atc": "L04AX08"
},
{
"ecl": "(<10363601000001109|UK Product| :10362801000001104|Has specific active ingredient| =<<724035008|Dimethyl fumarate|)"
}
]
}
This expands the ATC codes L04AX07 L04AX08 and supplements with any other product containing DMF as its active ingredient.
If multiple expressions are used, the default is to perform a logical OR. That means this is equivalent to the above expression:
[
{
"atc": "L04AX07"
},
{
"atc": "L04AX08"
},
{
"ecl": "(<10363601000001109|UK Product| :10362801000001104|Has specific active ingredient| =<<724035008|Dimethyl fumarate|)"
}
]
Duplicate keys are not supported, but multiple expressions using different keys are.
{
"atc": "L04AX07",
"ecl": "(<10363601000001109|UK Product| :10362801000001104|Has specific active ingredient| =<<724035008|Dimethyl fumarate|)"
}
When no operator is explicitly provided, a logical ‘OR’ will be performed.
For concision, all keys can take an array (vector), which will be equivalent to using “or” using the same codesystem.
{
"atc": [
"L04AX07",
"L04AX08"
]
}
Boolean operators “and”, “or” and “not” can be nested arbitrarily for complex expressions.
codelists
also supports ICD-10.
{
"icd10": "G35*"
}
will expand to include all terms that map to an ICD-10 code with the prefix “G35”, and its descendents.
The operator “not” must be defined within another term, or set of nested terms. The result will be the realisation of the first term, or set of nested terms, MINUS the realisation of the second term, or set of nested terms.
{
"icd10": "G35",
"not": {
"ecl": "<24700007"
}
}
Or, perhaps a more complex expression:
{
"or": [
{
"icd10": "G35"
},
{
"icd10": "G36.*"
}
],
"not": {
"ecl": "<24700007"
}
}
Or, more concisely:
{
"icd10": [
"G35",
"G36.*"
],
"not": {
"ecl": "<24700007"
}
}
These will generate a set of codes that includes codes “G35” and any with the prefix “G36.” but omit “24700007” ( multiple sclerosis).
You can use wildcards. Here I directly use a running codelists
HTTP server
to expand a codelist defined as
{
"atc": "C08*"
}
This should give a codelist containing all calcium channel blockers.
http '127.0.0.1:8080/v1/codelists/expand?s={"atc":"C08*"}'
Result:
[
374049007,
13764411000001106,
376841009,
11160711000001108,
893111000001107,
29826211000001109,
376754006,
...
You can customise how data are returned.
By default, a list of codes is returned.
To return identifier and name, use ‘as=names’
http '127.0.0.1:8080/v1/codelists/expand?s={"atc":"C08*"}&as=names'
Result:
[
{
"id": 374049007,
"term": "Nisoldipine 20mg tablet"
},
{
"id": 13764411000001106,
"term": "Amlodipine 5mg tablets (Apotex UK Ltd)"
},
{
"id": 376841009,
"term": "Diltiazem malate 120 mg oral tablet"
},
{
"id": 11160711000001108,
"term": "Exforge 10mg/160mg tablets (Novartis Pharmaceuticals UK Ltd)"
},
{
"id": 893111000001107,
"term": "Tildiem LA 300 capsules (Sanofi)"
},
...
For reproducible research, codelists
will include information about how the codelist was generated, including the
releases of SNOMED CT, dm+d and the different software versions. It should then be possible to reproduce the content of
any codelist. At the moment, only the data versions are returned:
http 127.0.0.1:8080/v1/codelists/status
The following metadata will be returned:
{
"dmd": {
"releaseDate": "2022-05-05"
},
"hermes": [
"© 2002-2021 International Health Terminology Standards Development Organisation (IHTSDO). All rights reserved. SNOMED CT®, was originally created by The College of American Pathologists. \"SNOMED\" and \"SNOMED CT\" are registered trademarks of the IHTSDO.",
"32.12.0_20220413000001 UK drug extension",
"32.12.0_20220413000001 UK clinical extension"
]
}