Author: Simon Burgess
RCT + NPD = Progress
A lot of research for education policy is focussed on evaluating the effects of a policy that has already been implemented. After all, we can only really learn from policies that have actually been tried. In the realm of UK education policy evaluation, the hot topic at the moment is the use of randomised control trials or RCTs.
In this post I want to emphasise that in schools in England we are in a very strong position to run RCTs because of the existing highly developed data infrastructure. Running RCTs on top of the census data on pupils in the National Pupil Database dramatically improves their effectiveness and their cost-effectiveness. This is both an encouragement to researchers (and funders) to consider this approach, and also another example of how useful the NPD is.
A major part of the impetus for using RCTs has come from the Education Endowment Foundation (EEF). This independent charity was set up with grant money from the Department for Education, and has since raised further charitable funding. Its goal is to discover and promote “what works” in raising the educational attainment of children from disadvantaged backgrounds. I doubt that anywhere else in the world is there a body with over £100m to spend on such a specific – and important – education objective. Another driver has been the Department for Education’s recent Analytical Review, led by Ben Goldacre, which recommended that the Department engage more thoroughly with the use of RCTs in generating evidence for education policy.
It is probably worth briefly reviewing why RCTs are thought to be so helpful in this regard: it’s about estimating a causal effect. There are of course many very interesting research questions other than those involving the evaluation of casual effects. But for policy, causality is key: “when this policy was implemented, what happened as a result?” The problem is that isolating a causal effect is very difficult using observational data, principally because the people exposed to the policy are often selected in some way and it is hard to disentangle their special characteristics from the effect of the policy. The classic example to show this is a training policy: a new training programme is offered, and people sign up; later they are shown to do better than those who did not sign up; is this because of the content of the training programme … or because those signing up evidently had more ambition, drive or determination? If the former, the policy is a good one and should be widened; if the latter, it may have no effect at all, and should be abandoned.
RCTs get around this problem by randomly allocating exposure to the policy, so there can be no such ambiguity. There are other advantages too, but the principal attraction is the identification of causal effects. Of course, as with all techniques, there are problems too.
The availability of the NPD makes RCTs much more viable and valuable. It provides a census of all pupils in all years in all state schools, including data on demographic characteristics, a complete test score history, and a complete history of schools attended and neighbourhoods lived in.
This helps in at least three important ways.
First, it improves the trade-off between cost and statistical power. Statistical power refers to the likelihood of being able to detect a causal effect if one is actually in operation. You want this to be high – undertaking a long-term and expensive trial and missing the key causal effect through bad luck is not a happy outcome. Researchers typically aim for 80% or 90% power. One of the initial decisions in an RCT is how many participants to recruit. The greater the sample size, the greater the statistical power to detect any causal effects. But of course, also, the greater is the cost, and sometimes this can be considerable. These trade-offs can be quite stark. For example, to detect an effect size of at least 0.2 standard deviations at standard significance levels with 80% power we would need a sample of 786 pupils, half of them treated. If for various reasons we were running the intervention at school level, we would need over 24,000 pupils.
This is where the NPD comes in. In an ideal world, we would want to be able to clone every individual in our sample and try the policy out on one and compare progress to their clone. Absent that, we can improve our estimate of the causal effect by getting as close as we can to ‘alike’ subjects. We can use the wealth of background data in the NPD to reduce observable differences and improve the precision of estimate of intervention effect. Exploiting the demographic and attainment data allows us to create observationally equivalent pupils, one of whom is treated and one is a control. This greatly reduces sampling variation and improves the precision of our estimation. This in turn means that the trade-off between cost and power improves. Returning to the previous numerical example, if we have a good set of predictors for (say) GCSE performance, we can reduce the required dataset for a pupil-level intervention from 786 pupils to just 284. Similarly for the school-cohort level intervention, we can cut back the sample from 24,600 pupils and 160 schools to 9,200 pupils and 62 schools. The relevant correlation is between a ‘pre-test’ and the outcome (this might literally be a pre-test, or it can be a prediction from a set of variables).
Second, the NPD is very useful for dealing with attrition. Researchers running RCTs typically face a big problem of participants dropping out of the study, both from the treatment arms and from the control group. Typically this is because the trial becomes too burdensome or inconvenient, rather than on principle because they did sign up in the first instance. This attrition can cause severe statistical problems and can jeopardise the validity of the study.
The NPD is a census and is an administrative dataset, so data on all pupils in all (state) schools are necessarily collected. This obviously includes all national Keystage test scores, GCSEs and A levels. If the target outcome of the RCT is improving test scores, then these data will be available to the researcher for all schools. Technically this means that an ‘intention to treat’ estimator can always be calculated. (obviously, if the school or pupil drops out and forbids the use of linked data then this is ruled out, but as noted above, most dropout is simply due to the burden).
Finally, the whole system of testing from which the NPD harvests data is also helpful. It embodies routine and expected tests so there is less chance of specific tests prompting specific answers. Although a lot about trials in schools cannot be ‘blind’ in the traditional way, these tests are blind. They are also nationally set and remotely marked, all of which adds to the validity of the study. These do not necessarily cover all the outcomes of interest such as wellbeing or health or very specific knowledge, but they do cover the key goal of raising attainment.
In summary, relative to other fields, education researchers have a major head start in running RCTs because of the strength, depth and coverage of the administrative data available.
Author: Michael Sanders
Arguing about funding obscures important issues of quality research
Richard Thaler, the Chicago professor of economics and incoming president of the American Economic Association, has as one of his many mantras the truism that “we can’t do evidence based policy without evidence”. The government’s recent decision to establish a number of “What Works Centres” to collate, analyse and, in some cases, produce, evidence on a number of policy areas seek to address the very problem of a lack of evidence.
Evidence itself, however, is not in short supply. Newspapers fill their pages, day after day, with the results of studies into some facet of human behaviour, or statistics on the state of the world. So, there need to be two other criteria for evidence than mere ‘existence’ – goodness, and usability. I should be clear at the outset that when I say “Good”, I mean “Capable of determining a causal relationship between an input and an output”. Sadly, not all evidence which is good is useable, and often tragically, not all evidence that is usable is good.
As Ben Goldacre points out in his recent paper for the department for education, many researchers in that field like qualitative work, and use this as the basis for their findings. As an economist, I have a natural scepticism for such research, but I cannot dispute that it is eminently useable. The arguments constructed by such research are easily and well presented. They offer solutions which are simple, and neat. However, as H.L. Mencken said, these arguments are almost always also wrong. This research is usable but very much of it is not good.
On the other side, much research which is good, and detailed, and thorough, presents complicated and nuanced answers which reflect reality but whose methods are impenetrable to anyone who might actually have the power to change policy accordingly.
Randomised Controlled Trials (RCTs) are both useable, with the majority of results presentable in an easily understood way and the methodology being simple enough to explain to a lay person in about five minutes. As the recognised ‘gold standard’ of evaluation, they are also indisputably good.
In a blog post for the LSE impact blog, Neil Harris, a researcher at Bristol’s Centre for Causal Analysis in Translation Epidemiology, argues that education research is a public good and needs to be funded by the state, as, unlike in medicine, there is not money to be made by researchers through patent development, education being a public good. He is, of course, absolutely right. The structure of his argument implies however, that in order to get good evidence, it will need to be paid for – i.e. that RCTs are expensive, while qualitative research is cheap. If the government wants better education research, they should give researchers more money. But, well, we would say that, wouldn’t we?
The argument that RCTs are expensive is a well-worn one, but is not helpful, and often dangerously distracting. Saying that an RCT is expensive is akin to saying “Vehicles are expensive”. If one chooses to put up Ferraris as an example, then of course they are. A scooter, however, is not. Both are better than walking.
A good quality, robust RCT need not be outlandishly expensive, and certainly not any more so than qualitative analysis. Unlike medical trials, the marginal cost of interventions in policy is often not far above that of treatment as usual (the most logical control condition). Teaching phonics in 50 schools and not in 50 others should not require vast resources once allocation has taken place. Although the government does not spend as much money on policy research as it does on medicine, it spends a lot of money gathering data on the outcomes many researchers are interested in. At the end of a child’s GCSEs, finding out how well they did does not require specialist staff to draw their blood and perform expensive tests on them. The school knows the answer.
It is important not to downplay the risks or costs associated with RCTs, but nor is it possible to present these costs as a reason for conducting, or accepting, substandard research. As researchers, if our work is of low quality, there is only so far the buck can be passed.
If there is a phrase more likely to attract rolled eyes than “Behavioural Economics”, it is probably “Evidence based policy” – one has in the past been derided for not really being economics, and the other for being more apt with the first and third words reversed. Supporters of both have often despaired. There is now, however, a glimmer of hope.
The Cabinet Office’s Behavioural Insights Team, charged by the coalition with bringing behavioural science to policymaking, this week launched a paper with the title “Test, Learn Adapt: Developing Public Policy with Randomised Controlled Trials”, which is part manifesto for randomised controlled trials (RCT), and half handbook on how to conduct them. Although not comprehensive, its nine steps offer a simple guide to the basics of designing and running a trial.
Co-authored by the Ben Goldacre of the Guardian, and Professor David Torgerson, director of the University of York’s Trial Unit, the paper gives every indication that government is, or at the least is trying to be, committed to the idea of running trials wherever possible, often with the help of the academic community – which will hopefully lead to a better understanding of which policies are beneficial and which are not. Although there are other, more comprehensive and technical guides to trial design, this is an important step for government.
The second step may be even more difficult – making sure that successful intervention are rolled out more widely when proved to be beneficial. For example an RCT by the University of Cambridge’s experimental criminologists has provided the first robust evidence of the benefits of ‘hotspot’ policing in the UK. Another, by Robert Metcalfe of the University of Oxford, provides clear implications about how to reduce household energy consumption in both the short and the long term. Both of these trials, although excellent, will be of very limited practical use if their findings are not used to inform future policy – this will require continued work both by policymakers, and by those running the trials.
 Haynes et al (2012): “Test, Learn Adapt: Developing Public Policy with Randomised Controlled Trials” Cabinet Office, available from: http://www.cabinetoffice.gov.uk/sites/default/files/resources/TLA-1906126.pdf
 Angrist & Pischke (2008): “Mostly Harmless Econometrics” Princeton University Press
 Gerber & Green (2012): “Field Experiments: Design, Analysis and Interpretation” W. W. Norton & Company
 Presented by Barak Ariel at “Beyond Nudge” conference at the British Academy, June 2012
 Presented by Robert Metcalfe at “Beyond Nudge” conference at the British Academy, June 2012