Ethnicity mismatches in administrative data linkages

children of different ethnicities

Dr Alice Wickersham is a Research Associate at the Institute of Psychiatry, Psychology & Neuroscience (IoPPN) King’s College London and Academic Lead for the CAMHS Digital Lab, focusing on clinical and population analytics. Her PhD was funded by the NIHR Maudsley BRC and investigated depression and school performance using linked data from the Department of Education and data from Clinical Record Interactive Search (CRIS) system.

In this blog she reflects on the journey of exploring the two datasets, how she discovered a mismatch in their ethnicity variables and what this means for research in this area. With other authors from IoPPN and South London and Maudsley NHS Foundation Trust she has published a paper on this subject in BMJ Open.

A promising start

It was 2019 and the beginning of my PhD. I’d just got my hands on the dataset that I’d be using for the next three years. It was an amazing dataset – 276,655 individuals, hundreds of variables, and bringing together administrative data from both school records (supplied by the Department for Education) and from child and adolescent mental health services (supplied by South London and the Maudsley NHS Foundation Trust through the Clinical Record Interactive Search).

My research would be focusing on the relationship between child and adolescent depression and educational attainment, including the roles that sociodemographic characteristics might play. Compared to previous studies I’d worked on, limited to smaller samples and fewer variables, having such a large and rich dataset was bliss. And some variables I even had two of – like ethnicity. One version taken from school records, the other taken from mental health records. The height of luxury. All I had to do was look at them, and pick the ‘best’ version of the variable!

An unexpected diversion

…Hang on. A problem. The two ethnicity variables didn’t match. There were people with one ethnic group recorded in their school records, and a different ethnic group recorded in their health records. For example, of the pupils recorded as Mixed ethnicity in their school records, only half were recorded as Mixed ethnicity in their mental health records. Of the remaining half, most were recorded as either Black or White ethnicity in their mental health records. This complicated things – which version of the variable should I use in my research?

I considered the focus of my PhD – educational attainment. It dawned on me that, for the purposes of my research at least, the discrepancies in these ethnicity variables only posed a problem if they materially impacted on my findings. Did the association between ethnicity and educational attainment vary depending on which ethnicity variable I used?

One quick analysis later and I had my answer: it did. As compared to White pupils, a significantly higher proportion of Asian pupils achieved 5 A*-C grades at GCSE, but only if ethnicity was coded from school records. If ethnicity was coded from mental health records, this association was not statistically significant. Similar discrepancies emerged when I looked at the association between ethnicity and neurodevelopmental disorder diagnoses (which include intellectual disability, autism, and ADHD).

I was starting to feel like I’d opened a can of worms. If these discrepancies in ethnicity could lead to different findings, I would have to find a way of choosing which ethnicity variable to rely on for my research. But how could I possibly work out which ethnicity variable was ‘better’? And what does better mean in this context?

Detective work begins

A possible answer might lie in the origins of these variables. How were they collected? I learned that, in school records, ethnicity is chosen by the pupil or their parent, although historically this was not always the case – some of our data were collected before the 2010s, when schools were allowed to choose ethnicity on behalf of the pupil. And the process could be similarly hazy in clinical settings, where recorded ethnicity is intended to be self-ascribed by the patient, but in practice, might be chosen on their behalf by a clinician or another staff member.person typing at laptop

On top of this, recording systems in different organisations often offer slightly different ethnic groups for individuals to choose from. The ethnicity that individuals most closely identify with might therefore not always be available, forcing them to select slightly different ethnicities in different services.

Individuals may also have other reasons for identifying as different ethnicities in different settings. Research has shown that the ethnicities which individuals identify with or report can be fluid and context-dependent, and can even change over time. Which raises the further question – what is ethnicity? We sometimes associate it with skin colour, but being socially constructed, it can also be informed by other shared characteristics like geography, ancestry, culture and language.

In a paper for BMJ Open, we further explore the discrepancies in our ethnicity data, discuss possible explanations, and suggest strategies for researchers using these data. But the capturing of ethnicity in administrative data presents a rich and remarkable area of study, worthy of a PhD or larger research project in its own right.

A way forward

magnifying glassSo, for researchers examining ethnic inequalities, how can we decide between multiple conflicting sources of ethnicity data? Talking to people from different ethnic backgrounds is crucial and can help with making such decisions; when I identified similar issues affecting another administrative data linkage, talking to the NIHR Maudsley’s READ Group informed how I should report my findings relating to discrepancies in ethnicity data.

Making analytical decisions about ethnicity might also get easier in the future. The Race Disparity Unit recently released guidance on collecting and reporting ethnicity data, which could hopefully lead to a more standardised approach across services and researchers going forward.

Meanwhile, if you’re conducting research and have multiple versions of the same variable to choose from – whether it’s ethnicity, gender, or any other sociodemographic data – it’s worth doing some detective work of your own, and checking how the different versions might impact on your results. It could make more of a difference than you expect. And, as individuals, it’s also interesting to reflect on how we want our identities and experiences to be represented in the data collected about us.


Photos by National Cancer Institute on Unsplash and Agence Olloweb on Unsplash


Tags: Informatics - CRIS blog -

By NIHR Maudsley BRC at 6 Mar 2024, 09:13 AM

Back to Blog List