There’s a problem with using SNOMED CT for data analytics; in this post, I’ll explain the issue and how to mitigate! If you’re using SNOMED CT for data analytics, you will encounter this issue, and need to handle it!

Data analytics using SNOMED CT

Imagine I’m building a real-time analytics pipeline for patients with multiple sclerosis.

I need to understand patient outcomes - and that means we need to define patient data into cohorts.

A cohort is a group of patients with shared characteristics.

That cohort might be defined by diagnosis, by treatment type, by age, by gender, by geography, by levels of socio-economic deprivation, or by something else.

SNOMED CT is a sophisticated and comprehensive clinical terminology that provides codes representing many of these characteristics. SNOMED is special because it isn’t simply a flat list of codes, but instead it is an ontology. As a result SNOMED defines concepts and the relationships between them.

For example, it defines multiple sclerosis as a type of demyelinating disorder. This means, used properly, I can not only search health and care data for patients with multiple sclerosis, but I can also search for demyelinating diseases and patients recorded as having multiple sclerosis will be included in that cohort simply as a result of the SNOMED ontological hierarchies. I don’t need end-users to record a diagnosis of demyelinating disorder, but I can search for that and include all disorders that are a sub-type of that disorder.

I can do the same for drugs in SNOMED CT - so I might want to search for drugs that contain, say, Glatiramer acetate - an immunological drug used in multiple sclerosis. Because the UK drug extension for SNOMED CT includes relationships such as “Has specific active ingredient”, it is straightforward to use SNOMED CT to slice and dice our health and care data in order to make valuable inferences.

Hermes

Hermes is an open-source terminology server that I wrote.

I have a tiny ($3/mo) demonstration server running - for example, you can look at detailed information about multiple sclerosis (SNOMED code 24700007) by going to http://128.140.5.148:8080/v1/snomed/concepts/24700007/extended.

You can have your own server running in minutes by following the instructions, or use my demonstration links below. It can even download and install SNOMED automatically if you live in the UK.

Here’s an example of one of the endpoints:

http://128.140.5.148:8080/v1/snomed/concepts/24700007/extended:

{
  "concept": {
    "id": 24700007,
    "effectiveTime": "2002-01-31",
    "active": true,
    "moduleId": 900000000000207008,
    "definitionStatusId": 900000000000074008
  },
  "descriptions": [
    {
      "id": 41398015,
      "effectiveTime": "2017-07-31",
      "active": true,
      "moduleId": 900000000000207008,
      "conceptId": 24700007,
      "languageCode": "en",
      "typeId": 900000000000013009,
      "term": "Multiple sclerosis",
      "caseSignificanceId": 900000000000448009,
      "refsets": [
        900000000000509007,
        900000000000508004,
        999001261000000100
      ],
      "preferredIn": [
        900000000000509007,
        900000000000508004,
        999001261000000100
      ],
      "acceptableIn": [
        
      ]
    },
  ],
  "parentRelationships": {
    "116680003": [
      6118003,      138875005,      404684003,
      123946008,      118234003,
      128139000,      23853001,
      246556002,      363170005,
      64572001,      118940003,
      414029004,      362975008,
      363171009,      39367000,
      80690008,      362965005
    ]
  },
  "refsets": [
    991381000000107,
    999002271000000101,
    991411000000109,
    1127581000000103,
    1127601000000107,
    900000000000497000,
    447562003
  ],
  "preferredDescription": {
    "id": 41398015,
    "effectiveTime": "2017-07-31",
    "active": true,
    "moduleId": 900000000000207008,
    "conceptId": 24700007,
    "languageCode": "en",
    "typeId": 900000000000013009,
    "term": "Multiple sclerosis",
    "caseSignificanceId": 900000000000448009
  }
}

It can also provide a FHIR terminology server API via hadex.

Hermes operates as a library, or a microservice. It is designed to be immutable once running - so that we might have services running providing different versions of SNOMED CT, each load-balancing. Other terminology servers do not use this approach, but instead update-in-place, with management of versions within the same terminology. I prefer multiple small services and switch at the API gateway level, or reverse proxy to different versions, all of which run independently.

The SNOMED CT expression constraint language

The specification for the SNOMED CT expression constraint language (ECL) is available here. It’s a way of defining a set of SNOMED CT concepts.

Here’s a simple example:

<<  73211009 |Diabetes mellitus|

This means, give me a set of codes that represent diabetes mellitus, including its sub-types. You can see the codes this expands to here.

When I am building a user interface component to allow a pop-up and autocompletion box, for say, country of birth, I might search based on the text the user has entered and limit the search to the set of concepts defined by:

<370159000|Country of birth|

This will mean a search for “Cro” will give me “Born in Croatia” but exclude “Crohn’s disease”.

http://128.140.5.148:8080/v1/snomed/search?s=cro&constraint=<370159000:

[
  {
    "id": 459924011,
    "conceptId": 315409004,
    "term": "Born in Croatia",
    "preferredTerm": "Born in Croatia"
  }
]

I wouldn’t want to record a diagnostic term in a field that should only record concepts that are a sub-type of country of birth. I can both configure and validate user input.

You can think of ECL as providing a quick and easy way to define a set of codes that you’re interested in. In essence, it builds codelists - a subset of codes which can be used or searched.

As you might expect, you can use boolean logic in an expression to combine different terms. For example, if you’re searching for patients who have reduced splenic function you might use

<<234319005|Splenectomy| OR <<23761004|Hyposplenism|

You can see the results of this expression here

I use the combination of a user-entered search string (e.g. “MND”) and a constraint to help users enter information in a context-appropriate way - e.g. by type, or by membership of a reference set etc.

Building code lists

In summary, we can use the expressions to realise a codeset. For example, we might want to build a list of diagnoses that are a type of neurological disease, suitable for use when interrogating data sources for an audit or for research.

In HL7 FHIR, the operation to turn an expression like this into a value set is called expansion.

So what’s the problem?

SNOMED CT is an evolving clinical terminology. That means it is updated, refined and changed over time. Fortunately, concepts are never deleted, and identifiers are never re-used, but concepts can be inactivated.

When this happens, all of its relationships are removed.

What this means in practice is that a now-outdated or redundant concept will not be found when we use the SNOMED CT relationships to define an interesting set of codes!

Let’s look at an example related to multiple sclerosis.

So we want all patients who have multiple sclerosis?

Won’t the codes we need will be included in the result of:

<< 24700007

That means give me the concept 24700007 and all of its descendants (sub-types). Have a look at the results here.

No it won’t!

Have a look at 24700007 in the SNOMED online browser - and start clicking on the children and the children of those children. The expression ‘«24700007’ will, in essence, return all of them for you in an instant.

Legacy data!

But let’s look at our legacy data. In our electronic health and care record, we have some old data that includes the concept 155023009 - this is an outdated, inactive concept representing multiple sclerosis, and it won’t be found using <<24700007! It won’t be found because inactivated concepts don’t have any active relationships.

This a problem!

That patient, just because they’ve been recorded as having multiple sclerosis using a term now inactive, potentially won’t show up in our dataset! This isn’t an uncommon scenario; and it is a problem that will increase as more health and care software uses SNOMED CT.

What are the potential solutions?

There are four options:

Highlight now inactive concepts in our dataset and manually update to the modern equivalents ie. fix our source data manually by flagging to end clinical users. Fix the problem by fixing our source data.
When processing our dataset, highlight inactive concepts and append the modern replacements or equivalents.
When generating searches, include outdated concepts in the code lists.
Provide easy access to multiple versions of SNOMED CT, selectable at runtime, so that inferences can be made based on the date the data were entered at each point.

There are a variety of trade-offs for each option but each can make use of the historical association reference sets that are provided as part of SNOMED CT.

An historical reference set provides a linkage between a now-outdated concept and what it might be better represented as nowadays.

Unfortunately, this isn’t as simple as it might sound. Some concepts are genuinely now regarded as wrong - or may be better represented using one of many more specific terms. Imagine a disease we used to think of as a single entity, but now realise that diagnostic entity is better represented as one of three different more specific entities, that might or might not be exactly equivalent?

Here are some example reference sets that will help us:

Name                     Concept identifier

REPLACED-BY              900000000000526001
SAME-AS                  900000000000527005
POSSIBLY-EQUIVALENT-TO   900000000000523009

The simplest is REPLACED-BY. There’ll be a 1:1 mapping between an old concept and a new concept if there is one that is conceptually REPLACED-BY the new one! But some concepts are truly outdated, and there will be some ambiguity in how to use that now outdated term.

We can even ask SNOMED to give us all of the historical association reference set types: http://128.140.5.148:8080/v1/snomed/expand?ecl=<900000000000522004

Result:

[
  {
    "id": 900000000001151017,
    "conceptId": 900000000000523009,
    "term": "POSSIBLY EQUIVALENT TO association reference set",
    "preferredTerm": "POSSIBLY EQUIVALENT TO association reference set"
  },
  {
    "id": 900000000001152012,
    "conceptId": 900000000000524003,
    "term": "MOVED TO association reference set",
    "preferredTerm": "MOVED TO association reference set"
  },
  {
    "id": 900000000001154013,
    "conceptId": 900000000000525002,
    "term": "MOVED FROM association reference set",
    "preferredTerm": "MOVED FROM association reference set"
  },
  {
    "id": 900000000001157018,
    "conceptId": 900000000000526001,
    "term": "REPLACED BY association reference set",
    "preferredTerm": "REPLACED BY association reference set"
  },
  ...

When operating interactively, we can ask our user to resolve ambiguities and map to a more modern term. But what about for analytics? We might be processing millions of health records in which it will not be practical to update legacy terms by hand.

We are left with two options:

Pre-process each health and care record mapping legacy terms to modern equivalents.
Pre-process our searches, valuesets and code lists so that they include legacy inactive concepts as well as the modern equivalents.

Pre-process the health and care record

For our inactive term, in our data pipeline, we could look for this concept’s historical association reference sets and include some or all of the modern replacements in-place.

We then perform analysis on a modified patient record that has been updated to use only active terms.

We can do this easily by identifying now inactivated concepts, and following the historical associations for that concept.

Let’s try a worked example:

You can see that 155023009 is inactive - and you can see the same via hermes:

{
  "id": 155023009,
  "effectiveTime": "2002-01-31",
  "active": false,
  "moduleId": 900000000000207008,
  "definitionStatusId": 900000000000074008
}

We can use this concept’s reference set membership to see how SNOMED thinks we might be able to map into the current version of the terminology:

http://128.140.5.148:8080/v1/snomed/concepts/155023009/historical:

{
  "900000000000527005": [
    {
      "id": "cc542ff9-d695-52ff-a20b-8091e5b0145b",
      "effectiveTime": "2002-01-31",
      "active": true,
      "moduleId": 900000000000207008,
      "refsetId": 900000000000527005,
      "referencedComponentId": 155023009,
      "targetComponentId": 24700007
    }
  ]
}

This end-point picks out only the historical association reference set types and lists them conveniently keyed by the association reference type. This makes it straightforward to follow SAME-AS, REPLACED-BY or POSSIBLY-EQUIVALENT-TO links.

So here we see that 155023009 is linked to 24700007 by virtue of a SAME-AS definition - 900000000000527005.

Pre-process our searches, valuesets and codelists

Alternatively, when we’re generating a list of codes for our code list, we could reverse this process and look at the modern concept(s) in which we are interested and get back the legacy inactive identifiers that we want to include.

We can ask hermes to expand any arbitrary SNOMED expression constraint language (ECL) expression:

e.g.

http://128.140.5.148:8080/v1/snomed/expand?ecl=%3C%3C24700007:

[
  {
    "id": 1223980016,
    "conceptId": 24700007,
    "term": "MS - Multiple sclerosis",
    "preferredTerm": "Multiple sclerosis"
  },
  ...

But we can also ask for the expansion to include historical associations:

http://128.140.5.148:8080/v1/snomed/expand?ecl=%3C%3C24700007&include-historic=true

You’ll see that our expanded codelist now also includes outdated, inactivated concepts from past versions of SNOMED. We can now use this expanded list for our data analytics.

[
  {
    "id": 27239011,
    "conceptId": 16092000,
    "term": "Cord multiple sclerosis",
    "preferredTerm": "Cord multiple sclerosis"
  },
  {
    "id": 30986014,
    "conceptId": 18353007,
    "term": "Brain stem multiple sclerosis",
    "preferredTerm": "Brain stem multiple sclerosis"
  },
  {
    "id": 1223980016,
    "conceptId": 24700007,
    "term": "MS - Multiple sclerosis",
    "preferredTerm": "Multiple sclerosis"
  },
  ...

Conclusions

Managing real-life health and care data is complex; health informatics needs to build software, ideally open-source, that manages some of these complexities. Open-source tools are ideal, because we create a shared community.

Hermes is a library and microservice that provides some of that capability in relation to SNOMED CT and other terminologies.

You cannot ignore the issue of managing codes now thought of as inactive, or outdated from your health and care data, but instead you need to think carefully about how to manage change over time.

Hermes provides a number of ways of managing those changes including versioned distributions, methods to identify and understand how to map outdated concepts to modern equivalents, to methods to create codelists based on expressions that optionally include historical equivalents of their members.

Mark