Appendix A — Linkage process

This section provides an overview of the linkage process (see figure under ECHILD database structure) used to create the ECHILD database.

A.1 Data sources

A.1.1 The Department for Education’s National Pupil Database

The Department for Education (DfE) collates and manages data on school children in England in a resource known as the National Pupil Database (NPD) (Jay, Mc Grath-Lone and Gilbert, 2019). For the purposes of the ECHILD linkage, salient points are described below.

Within NPD, each record is associated with natural identifiers (forename, surname, gender, date of birth, postcode) relating to the pupil’s details as known by the submitting organisation at the time the data were submitted. DfE additionally assigns each record in NPD an anonymised Pupil Matching Reference (aPMR). An aPMR is an identifier which is not in itself meaningful: it does not reveal the identity of the pupil. aPMRs are assigned to all records in NPD such that one aPMR should represent one pupil and each pupil should have only one aPMR. This allows records for each pupil to be identified across NPD, both between different datasets and over time, without revealing their identity. As a result, NPD contains a longitudinal record of pupils’ names, addresses, and (potentially) genders over time.

A.1.2 NHS England’s Personal Demographics Service and Master Person Service

NHS England operates the Personal Demographics Service (PDS), a national electronic database of demographic data for patients accessing care in England or services funded by the NHS in England. PDS contains natural identifiers (forename, surname, gender, date of birth, postcode) and NHS number. Each person is assigned a distinct NHS number at birth if born in England, or at the first time of accessing NHS services in England if not otherwise registered. PDS holds a longitudinal record of name and address changes made to NHS services in England over time. This means there may be many records for each person in PDS but all records for a person should be assigned the same NHS number (with some exceptions, see (NHS Primary Care Support England, 2023)).

The Master Person Service (MPS), managed by NHS England (NHS England, 2023l), takes a record of natural identifiers and attempts to link this to a PDS record allowing for some errors and missingness in the recording of the natural identifiers. If this fails and the natural identifiers are sufficiently complete, MPS attempts to link records against a secondary store (MPS bucket) of natural identifiers of persons who previously had contact with the NHS in England and do not have an NHS number, these persons are assigned an “MPS ID”.

The MPS returns a “Person ID” using either:

  1. NHS number, if a valid link is found in PDS; or,
  2. MPS ID if no link is found in PDS but a valid link is found in the MPS bucket; or,
  3. No value if no link is found in either PDS or the MPS bucket.

The Person ID is then encrypted to generate a Token Person ID (TPI), which is not meaningful and does not reveal the person’s identity.

Person IDs, enabling the assignment of Token Person IDs, are also recorded throughout NHS England’s standard data collections, including Hospital Episode Statistics datasets, Emergency Care Datasets, Mental Health Services Datasets, and Community Services Datasets. The “Person ID” is the only routine means of identifying records belonging to the same patient amongst data held by NHS England.

A.2 Linking DfE NPD aPMRs to NHS England TPIs

For the purposes of the ECHILD linkage, the following simplifying assumptions were made:

  1. Each aPMR represents at most one real person within NPD;
  2. Each TPI represents precisely one real person within NHS England data collections;
  3. Each real person represented within NHS England data collections has precisely one TPI.

Essentially, whilst we assume TPIs are perfectly allocated, we only require that aPMRs are not shared. That is, the same real person is permitted to have more than one aPMR.

This linkage task resulted in “N-to-one” links between aPMRs and TPIs: each aPMR linked to at most one TPI but a TPI may have linked to more than one aPMR. However, the vast majority of links made were “1-to-1” (99.4% of linked TPIs linked to only one aPMR).

DfE supplied a “linkage dataset” to NHS England, comprising the natural identifiers (forename, surname, gender, date of birth, postcode) and an aPMR for each record in its NPD.

A.2.2 Linkage Stage 2: MPS Trace

Only records belonging to aPMRs not linked after Linkage Stage 1 were submitted to MPS. Again, an aPMR was considered linked only if all of its associated linkage records were linked to at most one TPI and at least one record was linked to a TPI.

A.2.3 Linkage results

DfE provided 430M records, comprising sets of identifiers, covering 22.8M aPMRs. NHS England managed to link 22.6M (99.1%) aPMRs to a TPI. 20.1M (88.2%) aPMRs were linked at Stage 1, Part A. A further 1.3M (5.7%) aPMRs were linked at Stage 1, Part B. Stage 2 resulted in the linkage of a further 1.2M (5.3%) aPMRs. Less than 200,000 aPMRs remained unlinked at the end of Stage 2: ~160,000 were not resolvable to any TPI; ~30,000 resolved to more than one TPI and (due to the simplifying assumptions) these were considered invalid and so no link was recorded for any of these aPMRs.

A.2.4 Application of NHS National Data Opt-Outs

The ECHILD Research Database team wished to enable potential participants to opt-out from their data being held within the ECHILD Research Database. Our data suppliers indicated that the only means to (partially) operationalise this was through the non-provision of data held by NHS England relating to participants with a current (at the date of data preparation) NHS National Data Opt-Out (NDOO). This included removing any indication of an identified link between the aPMRs and TPIs for participants with a NDOO. It was, however, not possible to exclude the DfE supplied data relating to these participants. 1.3M (5.8%) of the 22.6M aPMRs for which NHS England had identified a link to a single TPI related to individuals with a registered NDOO as at the date of generation of the data extract.

A.2.5 Linkage Outputs: Pseudonymised bridging file

NHS England produced a pseudonymised bridging file consisting of all aPMRs in the DfE-supplied linkage dataset and their linked TPI (excluding those removed due to the presence of a NHS National Data Opt-Out). All aPMRs that were not linked after Linkage Stage 2, or which related to a participant with a NHS National Data Opt-Out, were omitted from this file.