Author: Simon Burgess
Threshold measures in school accountability: asking the right question
We are in the midst of a significant upheaval in the setting and marking of exams, and the reporting of school exam results. One feature of the system has been the centre of a lot of criticism and highlighted for reform: the focus on the percentage of a school’s pupils that achieve at least 5 GCSEs at grades C to A*, including the scores on English and maths. This is typically the most-discussed metric for (secondary) school performance and is the headline figure in the school league tables.
The point is that this measure is based on a threshold, a ‘cliff-edge’. Get a grade C and you boost the school’s performance; missing a C by a lot or a little are the same, and just scraping a C is the same as getting an A*.
This has been described as distorting schools’ behaviour, forcing schools to focus on pupils around this borderline. The argument is seen as obviously right and strong grounds for change. In this post I want to make two counter-arguments, and to suggest we are asking the wrong question.
First a basic point. One central goal of any performance measure is to induce greater or better-targeted effort. This might just mean “working harder” or it might mean a stronger focus on the goals embodied in the measure at the expense of other outcomes. The key for the principal is to design the best scheme to achieve this. A very common scheme is a threshold one – this can be found for example in the Quality and Outcomes Framework for GPs, service organisations with a target number of clients to see, and of course schools trying to help pupils to achieve at least 5 grades of C or better. An organisation working under a threshold scheme faces very different marginal incentives for effort. Considering pupils: the most intense incentives relate to pupils just below the line: this is where the greatest payoff is to schools to devote the most resources.
The first counter argument starts by noting that the asymmetry in the incentive is not a newly-discovered flaw, it is a design feature which can be very powerful. If there is a level of achievement that is extremely important for everyone to reach, then it makes sense to set up a scheme that offers very strong incentives to do that – that focusses the incentive around that minimum level. This is precisely what a threshold scheme does.
So rather than simply pointing out that threshold designs strongly focus attention (which is what they’re supposed to do), the questions to ask are: is there some level of attainment that has that characteristic of being a minimum level of competence? And if so, what is it? If society feels that 5 grade C’s is a fair approximation to a minimum level that we want everyone to achieve, then it is absolutely right to have a ‘cliff-edge’ there because inducing schools to work very hard to get pupils past that level is exactly what society wants. It may be that we are equally happy to see grades increase for the very brightest children, those in the middle or those at the lower end of the ability distribution. Or not: all the main political parties express a desire to raise attainment at the lower end and narrow gaps.
The argument should be about where to put the threshold, not whether to have one or not. Perhaps we are starting to see a recognition of this in the recent policy announcement that all pupils will have to continue studying until they have passed English and Maths.
The second counter-argument is based on a scepticism of what is likely to happen without the 5A*-C(EM) threshold acting as a focal point.
The core strategic decision facing a headteacher is how best to deploy her main resource: the teachers. Specifically: how best to assign teachers of varying effectiveness to different classes. It has been said that schools will be free to focus equally on all pupils.
Well, maybe. Or perhaps we should think of the pressures on the headteacher, in this instance from teachers themselves. Effective teachers are very valuable to a school and any headteacher will be keen to keep her most effective teachers happy and loyal. It seems likely (I have no evidence on this, and would be keen to hear of any) that top teachers would typically prefer to teach top sets. If so, we might see a drift of the more effective teachers towards the more able classes in a school (and therefore on average, the more affluent pupils). The imperative of the C/D threshold gave headteachers an unanswerable argument to push against this.
So threshold metrics have an important role to play in communicating to schools where society wants them to focus their effort. The current threshold, at 5 C grades, may or may not be at the right level; but discussing what the right level is, is a more useful debate to have.
Tomorrow the new school league tables are published, with the usual blitz of interest in the rise and fall of individual schools. The arguments for and against the publication of these tables are now so familiar as to excite little interest.
But this year there is a significant change in the content of the tables. For the first time, GCSE results for each school will be reported for groups of pupils within the school, groups defined by their Keystage 2 (KS2) scores. Specifically, for each school the tables will report the percentage of pupils attaining at least 5 A* – C grades (including English and maths) separately for low-attaining pupils, high attaining pupils and a middle group. This change has potentially far-reaching implications, which we describe below.
This is a change for the better, one that we have proposed and supported elsewhere. Why? We believe that in order to support parents choosing a school, league tables need to be functional, relevant and comprehensible. The last of these is straightforward (though not all league table measures in the past have been comprehensible: Contextualised Value-Added (CVA) being the perfect example). ‘Relevant’ means that a measure has some relevance to the family’s specific child. A simple school average, such as the standard whole-cohort % 5 A* to C, is not very informative about how one specific pupil is likely to get on there. By ‘functional’ we mean a measure that does actually help a family to predict the likely GCSE attainment of their child in different schools. If a measure is not functional it should not be published at all.
The new group-specific component is comprehensible and is more relevant than the whole-cohort %5 A* to C measure. In our analysis of functionality, we show that it is as good as the standard measure, and much better than CVA.
It also addresses in a very straightforward way the critique of the standard league tables that they simply reflect the ability of the intake into schools, and not the effectiveness of the school. By reporting the attainment of specific groups of students of given ability, this measure automatically corrects for prior attainment, and in a very transparent way. This is therefore much more informative to parents about the likely outcome for their own children than a simple average. This of course is what value-added measures are meant to do, but they have never really become popular, and as we show they are not very functional.
However, the details of the new measure now published are problematic in one way. The choice of groups is important. We defined groups by quite narrow ten percentile bands, the low attaining group lying between the 20th and 30th percentiles in the KS2 distribution, the high attaining group between the 70th and 80th percentiles, and the middle group between the 45th to 55th percentiles. While clearly there is still variation in student ability within each band, it is second order and the main differences between schools in performance for any group will come from variation in schools’ teaching effectiveness.
However, the DfE has chosen much broader bands, and have defined the groups so that they cover the entire pupil population: the low attaining group are students below the expected level (Level 4) in the KS2 tests; the middle attaining are those at the expected level, and the high attaining group comprises students above the expected level.
This has one significant disadvantage, set out in detail by Rebecca Allen here. The middle group contains around 45% of all pupils, and so there is very significant variation in average ability within that group across schools. This in turn means that differences in league table performance between schools will reflect differences in intake as well as effectiveness, even within the group, thus partly undermining the aim of group-specific reports.
The chart below illustrates this for the middle attainment group (see here for more details). Each of the three thousand or so tiny blue dots shows the capped GCSE attainment for a group of mid-attaining pupils (on the DfE’s measure of achieving at the expected level at KS2) against the average KS2 score (i.e. prior attainment) of pupils at the school. The red dots plot the same relationship for our narrow group of middle attainers (the 45th to the 55th percentile). The chart shows very clearly that the performance among our narrow band is essentially unrelated to prior attainment, but the DfE measure for the very broad group does still favour schools with higher prior ability pupils.
We can speculate as to why the DfE chose to have much broader groups. There may be statistical reasons, pragmatic reasons or what can be termed “look and feel” reasons. Using narrow KS2 bands will correctly identify the effectiveness of the school, but will almost always be averaging over a small number of students. So the estimates will tend to be “noisy”, and induce more variation from year to year than averaging over bigger groups. The trade-off here is then between a noisy measure of something very useful against a more stable measure of something less useful. Our original measure was intended to balance these, the DfE have gone all the way to the latter.
A pragmatic reason is that some schools may not have any pupils in a particular narrow percentile band of the KS2 distribution. The narrower the band the more likely this is to be true. This would mean either null entries in the league tables, which might be confusing, or some complex statistical imputation procedure, which might be more confusing. The broad groups that cover the entire pupil population are likely to have very few null entries. Finally, the broad groups feel more ‘inclusive’, they report the performance of all of a school’s students. This is a red herring – the point of the tables is to inform parents in choosing a school, not to generate warm glows.
The new measures hold out the promise of improvements in two areas: choices by parents and behaviour by schools. Parents will have better information on the likely academic attainment of their child in a range of schools. Second, parents will be able to see more directly whether school choice actually matters a great deal for them: whether there are worthwhile differences in attainment within the ability group of their child.
The key point for schools is that performance measures have consequences for behaviour. If this new measure is widely used, it will give schools more of an incentive to focus across the ability distribution. It is still the %5 A* – C measure that is the focus of attention for each group, but now schools will have to pay attention to improving this metric for high and low ability groups as well as simply the marginal children with the highest chance of getting that crucial fifth C grade.
If one believes that gaming and focussing of resources within schools is a very big deal (and there is little quantitative evidence either way) then the new measures could have a major impact on such behaviour. Even if such resource focussing is second order, performance measures send signals on what is valued. These new league table measures will explicitly draw widespread media and public attention to the performance of low- and high-ability children in every school in England.
Yesterday the Government published its response to the Wolf Review on Vocational Education. The Response sets out a number of proposals, accepting all of the Review’s recommendations. These include the eye-catching scheme to ensure that young people who do not achieve C grade in English and maths at age 16 continue studying them to age 19.
The response also proposes reforms to school performance tables. This is based on a recognition that schools’ behaviour in selecting qualifications for their students is strongly influenced by the incentive structure they face. A crucial component of this structure is the published school performance tables. These tables are important in influencing parental choice of school, and school leadership teams pay them a lot of attention.
From this year, the content of the performance tables will change quite significantly. The long-standing measure of the percentage of students achieving at least 5 A* to C grades will be retained. But in addition, a differential average points score will be published for each school, which provides information on how well the school does for students at the lower and upper ends of the ability distribution, as well as at the average:
“It is vital that performance indicators do not inadvertently cause schools to concentrate on particular groups of pupils at the expense of others. To avoid this we will continue to include performance measures, like average point scores, which capture the full range of outcomes for pupils of all abilities. In addition, from 2011 the performance tables will show for each school the variation in performance of low attaining pupils, high attaining pupils and those performing as expected.” (Wolf Review of Vocational Education, Government Response, p. 6).
This is a step forward. In our analysis we argued for exactly this measure: average GCSE points score, presented at three points in the ability distribution, low ability, average and high ability. Our criteria were functionality of the performance measure, relevance to parents and also comprehensibility. A measure is relevant if it informs parents about the performance of children very similar to their own in ability and social characteristics. It is comprehensible if it is given to them in a metric that they can meaningfully interpret. It is functional if it helps parents to answer the question: “In which feasible choice school will my child achieve the highest exam score?”. Overall, this performance measure came out on top. We also described ways that the information could best be displayed for parents: paper-based and web-based delivery mechanisms.
A second issue is that the “price” or GCSE-equivalent points of the new vocational exams seems set to change. Precise details of this are unclear at the moment. It is worth making the point again that schools will have an eye on the performance table impact of courses they offer to students. If vocational qualifications are to be worth less than they are at present receive, there is a danger that schools will not be keen to promote them to students who may be unlikely to score highly on more academic courses. In turn, this may make schools less keen to accept low ability pupils.
Of course, the old league table measure of percentage with 5 A* to C grades is staying. Perhaps there is a performance management version of Gresham’s Law and good performance measures will drive out bad ones. If parents come to rely more on this measure, the media will give it more prominence and the grip of the “%5A-C” measure on the public mind will finally begin to weaken.
Today is “school league tables” day. Performance tables are released for schools and colleges in England, reporting a number of different measures of the exam performance of their students. While much attention this year will focus on the reporting of the new “English Baccalaureate”, we ask a more fundamental question: are school league tables in general any use to parents? One of the major aims for school league tables is to support and inform parents in choosing a school for their child: but are they fit for this purpose? The answer is “yes” – we show that using school league tables does help parents to identify the school in which their own specific child will do best in her future exams.
Parents consistently rank academic standards as being one of the most important criteria for choosing a school. The performance tables provide outcome measures that are very widely reported and easy to get hold of. The idea is that parents can scrutinise the results and weigh up the merits of the local schools, considering the academic performance, travel distance, the child’s own wishes and other factors before deciding which schools to write down on their application form.
But this idea has been subject to a number of critiques. There are three main lines of argument. First, it is argued that differences in raw exam performance largely reflect differences in school composition; they do not reflect teaching quality and so are not informative about how one particular child might do at a school. Second, schools might be differentially effective so that even measures of average teaching quality or test score gains may be misleading for students at either end of the ability distribution. Different school practices and resources might be more important for gifted students or others for low ability. Third, it is argued that the scores reported in performance tables are so variable over time that they cannot be reliably used to predict a student’s future performance. After all, today’s league tables reflect last year’s students’ exams, but a parent wants to know how her child will do in five years time.
It is an empirical question how quantitatively important these points are: are league tables helpful or not? The question on academic standards that parents want answered is: “In which feasible choice school will my child achieve the highest exam score?”. We argue that the best content for school performance tables is the statistic that best answers this question.
To answer this question, we use the long run of pupil data now available to researchers. We can follow students through their years at secondary school and see how they did in the exams at the end; that is standard. But we can also use statistical procedures (details) to estimate the counter-factuals of how that student would have done if s/he had gone to a different local school. We can then ask: if families had picked schools according to the league table information available at the time, would that have turned out to have been a good choice in terms of subsequent exam performance for that specific child? Focussing on the simplest measure of the school’s %5A*-C score, the results show that while it certainly does not produce a good choice for everyone, it produces a good choice for twice as many students than it produces a poor choice for. So on average, a family using the schools’ %5A*-C scores from the league tables to help identify a school that would be good academically for their child will do much better than the same family ignoring the league table information.
So are the league tables useful for parents? Definitely. Can they be improved? Certainly. The measures included in the performance tables should be judged according to their functionality, relevance, and comprehensibility. The test of functionality is the analysis just described. A measure is relevant if it informs parents about the performance of children very similar to their own in ability and social characteristics. It is comprehensible if it is given to them in a metric that they can meaningfully interpret. In fact, none of the current leading performance measures score very well across our three criteria. We have proposed an alternative measure that performs better on these criteria. No measure can be perfect because there are important trade-offs between relevance, functionality and comprehensibility: the more disaggregate the form in which performance tables are provided (increased relevance), the less precision they will have (decreased functionality). The more factors are taken into account in describing school performance for one specific child (increased relevance), the more complex the reported measure will be (decreased comprehensibility). Any choice on the content of league table information has to make decisions on these trade-offs.
The release this week of the latest round of international comparative education results produced some fascinating results. Not least of these was the outcome for Wales, characterised by the Wales’ Education Minister as alarming and “unacceptable”.
The PISA (Programme for International Student Assessment) results derive from a standardised international assessment of 15-year-olds, run by the OECD. They show that Wales has fallen further behind since the last tests in 2006, and scored worse than before in each of reading, maths and science. Scores in Wales have fallen relative to England and are now “cast adrift from England, Scotland and Northern Ireland”. The Wales Education Minister, Leighton Andrews, described the results as reflecting “systemic failure”.
What might that systemic failure be? One leading candidate is highlighted in our recent research on accountability mechanisms for state schools. We argue that the decision in 2001 by the Welsh Assembly Government (WAG) to stop the publication of school performance tables or “league tables” has resulted in a significant deterioration in GCSE performance in Wales. The effect is sizeable and statistically significant. It amounts to around 2 GCSE grades per pupil per year; that is, achieving a grade D rather than a B in one subject. This is a substantial effect, equivalent to the impact of raising class size from 30 to 38 pupils.
Although our results are based on a study of the GCSE scores school-by-school, Figure 1 gives a very stark impression of the overall effect. Students in England and Wales were performing very similarly up to 2001, but thereafter the fraction gaining 5 good passes has strongly diverged.
We take each secondary school in Wales, and match it up to a very similar school in England. This “matching” is based on pupils’ prior attainment, neighbourhood poverty and school funding among other factors. We then track the progress (or value added) students make in these schools before and after the league tables reform, comparing the Welsh school with its English match. Our analysis explicitly takes account of the differential funding of schools in England and Wales, and the greater poverty rates found in neighbourhoods in Wales.
Why should the removal of school league tables lead to a fall in school performance? Part of the effect is though the removal of information to support parental choice of school. The performance tables allow parents to identify and then apply to the higher scoring schools, and to identify and perhaps avoid the low scoring schools. This lack of applications puts pressure on the latter schools to improve. But this is not all of the story. Perhaps as important is the simple public scrutiny of performance, and in particular the public identification of the low scoring schools. This “naming and shaming” means that low scoring schools in England are under great pressure to improve, whereas the same schools in Wales are more able to hide and to coast.
Our work has attracted criticism, including a charge of using an “ideological theory” from teacher unions . A more thoughtful critic has accused us of a “howler” in the analysis: not noting the introduction of the original GCSE-equivalent qualifications. In fact, since these were introduced equivalently in both countries they simply net out of the comparison.
Responding to our research, the Welsh Assembly Government said “wait for the PISA results”. These results are now in, and do not make happy reading. No doubt there are many factors underlying the relative performance of Wales and England, but the diminution of public accountability for the country’s schools is surely one of them.