Appendix B — Linkage process
This section provides an overview of the linkage process (Figure 2.2) used to create the ECHILD database.
B.1 Data sources
B.1.1 The Department for Education’s National Pupil Database
The Department for Education (DfE) collates and manages data on school children in England in a resource known as the National Pupil Database (NPD) (Jay, Mc Grath-Lone and Gilbert, 2019). For the purposes of the ECHILD linkage, salient points are described below.
Within NPD, each record is associated with natural identifiers (forename, surname, gender, date of birth, postcode) relating to the pupil’s details as known by the submitting organisation at the time the data were submitted. DfE additionally assigns each record in NPD an anonymised Pupil Matching Reference (aPMR). An aPMR is an identifier which is not in itself meaningful: it does not reveal the identity of the pupil. aPMRs are assigned to all records in NPD such that one aPMR should represent one pupil and each pupil should have only one aPMR. This allows records for each pupil to be identified across NPD, both between different datasets and over time, without revealing their identity. As a result, NPD contains a longitudinal record of pupils’ names, addresses, and (potentially) genders over time.
B.1.2 NHS England’s Personal Demographics Service and Master Person Service
NHS England operates the Personal Demographics Service (PDS), a national electronic database of demographic data for patients accessing care in England or services funded by the NHS in England. PDS contains natural identifiers (forename, surname, gender, date of birth, postcode) and NHS number. Each person is assigned a distinct NHS number at birth if born in England, or at the first time of accessing NHS services in England if not otherwise registered. PDS holds a longitudinal record of name and address changes made to NHS services in England over time. This means there may be many records for each person in PDS but all records for a person should be assigned the same NHS number (with some exceptions, see (NHS Primary Care Support England, 2023)).
The Master Person Service (MPS), managed by NHS England (NHS England, 2023l), takes a record of natural identifiers and attempts to match this to a PDS record allowing for some errors and missingness in the recording of the natural identifiers. If this fails and the natural identifiers are sufficiently complete, MPS attempts to match records against a secondary store (MPS bucket) of natural identifiers of persons who previously had contact with the NHS in England and do not have an NHS number, these persons are assigned an ‘MPS ID’.
The MPS returns a “Person ID” using either:
- NHS number, if a valid match is found in PDS; or,
- MPS ID if no match is found in PDS but a valid match is found in the MPS bucket; or,
- No value if no match is found in either PDS or the MPS bucket.
The Person ID is then encrypted to generate a Token Person ID (TPI), which is not meaningful and does not reveal the person’s identity.
Person IDs, enabling the assignment of Token Person IDs, are also recorded throughout NHS England’s standard data collections, including Hospital Episode Statistics datasets, Emergency Care Datasets, Mental Health Services Datasets, and Community Services Datasets. The ‘Person ID’ is the only routine means of identifying records belonging to the same patient amongst data held by NHS England.
B.2 Linking DfE NPD aPMRs to NHS England TPIs
For the purposes of the ECHILD linkage, the following simplifying assumptions were made:
- Each aPMR represents precisely one “real” person within NPD;
- Each TPI represents precisely one “real” person within NHS England data collections;
- Each “real” person represented within NHS England data collections has precisely one TPI.
Essentially, whilst we assume TPIs are perfectly allocated, we only require that aPMRs are not shared. That is, the same “real” person is permitted to have more than one aPMR.
This linkage task resulted in ‘N-to-one’ links between aPMRS and TPIs: linking each aPMR to a single TPI but, a TPI may be linked to more than one aPMR. However, the vast majority of links made were ‘1-to-1’.
DfE supplied a ‘linkage dataset’ to NHS England, comprising the natural identifiers (forename, surname, gender, date of birth, postcode) and an aPMR for each record in its NPD.
B.2.1 Linkage Stage 1: Exact link
An initial, simple, linkage stage was used to avoid over-burdening the more resource-intensive MPS trace. Each valid record (e.g., no blank entries) in the linkage dataset was compared to all records in a prepared extract from PDS. A record was deemed “linked” if compared records matched on the following criteria:
- First four characters of forename;
- Full surname;
- Full date of birth;
- Full postcode; and,
- Gender.
Further, an aPMR was considered ‘linked’ if all of its associated linkage records were linked to at most one TPI and at least one record was ‘linked’ to a TPI.
B.2.2 Linkage Stage 2: MPS Trace
Records with aPMRs that were not ‘linked’ in Linkage Stage 1 were submitted to MPS. Again, an aPMR was considered ‘linked’ only if all of its associated linkage records were linked to at most one TPI and at least one record was ‘linked’ to a TPI.
B.2.3 Application of NHS National Data Opt-Outs
The ECHILD Research Database team wished to enable potential participants to opt-out from their data being held within the ECHILD Research Database. Our data suppliers indicated that the only means to (partially) operationalise this was through the non-provision of data held by NHS England relating to participants with a current (at the date of data preparation) NHS National Data Opt-Out (NDOO). This included removing any indication of an identifed ‘link’ between the aPMRs and TPIs for participants with a NDOO. It was, however, not possible to exclude the DfE supplied data relating to these participants.
B.2.4 Linkage Outputs: Pseudonymised bridging file
NHS England produced a pseudonymised bridging file consisting of all aPMRs in the DfE-supplied linkage dataset and their linked TPI (excluding those removed due to the presence of a NHS National Data Opt-Out). All aPMRs that were not matched after Linkage Stage 2, or which related to a participant with a NHS National Data Opt-Out, were omitted from this file.