Adaptive comparative judgement is a technique borrowed from psychophysics which is able to generate reliable results for educational assessment – as such it is an alternative to traditional exam script marking. In the approach, judges are presented with pairs of student work and are then asked to choose which is better, one or the other. By means of an iterative and adaptive algorithm, a scaled distribution of student work can then be obtained without reference to criteria.

Introduction

Traditional exam script marking began in Cambridge 1792 when, with undergraduate numbers rising, the importance of proper ranking of students was growing. So in 1792 the new Proctor of Examinations, William Farish, introduced marking, a process in which every examiner gives a numerical score to each response by every student, and the overall total mark puts the students in the final rank order.

Francis Galton Sir Francis Galton, FRS FRAI (; 16 February 1822 – 17 January 1911), was an English Victorian era polymath: a statistician, sociologist, psychologist, anthropologist, tropical explorer, geographer, inventor, meteorologist, proto- ...

(1869) noted that, in an unidentified year about 1863, the

Senior Wrangler The Senior Frog Wrangler is the top mathematics undergraduate at the University of Cambridge in England, a position which has been described as "the greatest intellectual achievement attainable in Britain." Specifically, it is the person who ...

scored 7,634 out of a maximum of 17,000, while the Second Wrangler scored 4,123. (The 'Wooden Spoon' scored only 237.) Prior to 1792, a team of Cambridge examiners convened at 5pm on the last day of examining, reviewed the 19 papers each student had sat – and published their rank order at midnight. Marking solved the problems of numbers and prevented unfair personal bias, and its introduction was a step towards modern objective testing, the format it is best suited to. But the technology of testing that followed, with its major emphasis on reliability and the automatisation of marking, has been an uncomfortable partner for some areas of educational achievement: assessing writing or speaking, and other kinds of performance need something more

qualitative Qualitative descriptions or distinctions are based on some quality or characteristic rather than on some quantity or measured value. Qualitative may also refer to: *Qualitative property, a property that can be observed but not measured numericall ...

and judgemental. The technique of Adaptive Comparative Judgement is an alternative to marking. It returns to the pre-1792 idea of sorting papers according to their quality, but retains the guarantee of reliability and fairness. It is by far the most reliable way known to score essays or more complex performances. It is much simpler than marking, and has been preferred by almost all examiners who have tried it. The real appeal of Adaptive Comparative Judgement lies in how it can re-professionalise the activity of assessment and how it can re-integrate

assessment Assessment may refer to: Healthcare *Health assessment, identifies needs of the patient and how those needs will be addressed *Nursing assessment, gathering information about a patient's physiological, psychological, sociological, and spiritual s ...

with learning.

History

Thurstone's law of comparative judgement

The science of comparative judgement began with

Louis Leon Thurstone Louis Leon Thurstone (29 May 1887 – 29 September 1955) was an American pioneer in the fields of psychometrics and psychophysics. He conceived the approach to measurement known as the law of comparative judgment, and is well known for his contr ...

of the

University of Chicago The University of Chicago (UChicago, Chicago, U of C, or UChi) is a private university, private research university in Chicago, Illinois. Its main campus is located in Chicago's Hyde Park, Chicago, Hyde Park neighborhood. The University of Chic ...

. A pioneer of psychophysics, he proposed several ways to construct scales for measuring sensation and other

psychological Psychology is the scientific study of mind and behavior. Psychology includes the study of conscious and unconscious phenomena, including feelings and thoughts. It is an academic discipline of immense scope, crossing the boundaries betw ...

properties. One of these was the

law of comparative judgment The law of comparative judgment was conceived by L. L. Thurstone. In modern-day terminology, it is more aptly described as a model that is used to obtain measurements from any process of pairwise comparison. Examples of such processes are the compa ...

(Thurstone, 1927a, 1927b), which defined a mathematical way of modeling the chance that one object will 'beat' another in a comparison, given values for the 'quality' of each. This is all that is needed to construct a complete measurement system. A variation on his model (see

Pairwise comparison Pairwise comparison generally is any process of comparing entities in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property, or whether or not the two entities are identical. The method of pairwis ...

and the BTL model), states that the difference between their quality values is equal to the log of the odds that object-A will beat object-B: :

\mathrm(A\ \text\ B\mid v_a,v_b)=v_a-v_b

Before the availability of modern computers, the mathematics needed to calculate the 'values' of each object's quality meant that the method could only be used with small sets of objects, and its application was limited. For Thurstone, the objects were generally sensations, such as intensity, or attitudes, such as the seriousness of crimes, or statements of opinions. Social researchers continued to use the method, as did market researchers for whom the objects might be different hotel room layouts, or variations on a proposed new biscuit. In the 1970s and 1980s, comparative judgement appeared, almost for the first time in educational assessment, as a theoretical basis or precursor for the new Latent Trait or Item Response Theories. (Andrich, 1978). These models are now standard, especially in item banking and adaptive testing systems.

Re-introduction in education

The first published paper using Comparative Judgement in education was Pollitt & Murray (1994), essentially a research paper concerning the nature of the English proficiency scale assessed in the speaking part of Cambridge's CPE exam. The objects were candidates, represented by 2-minute snippets of video recordings from their test sessions, and the judges were Linguistics post-graduate students with no assessment training. The judges compared pairs of video snippets, simply reporting which they thought the better student, and were then clinically interviewed to elicit the reasons for their decisions. Pollitt then introduced Comparative Judgement to the UK awarding bodies, as a method for comparing the standards of A Levels from different boards. Comparative judgement replaced their existing method which required direct judgement of a script against the official standard of a different board. For the first two or three years of this Pollitt carried out all of the analyses for all the boards, using a program he had written for the purpose. It immediately became the only experimental method used to investigate exam comparability in the UK; the applications for this purpose from 1996 to 2006 are fully described in Bramley (2007). In 2004, Pollitt presented a paper at the conference of the International Association for Educational Assessment titled Let's Stop Marking Exams, and another at the same conference in 2009 titled Abolishing Marksism. In each paper the aim was to convince the assessment community that there were significant advantages to using Comparative Judgement in place of marking for some types of assessment. In 2010 he presented a paper at the Association for Educational Assessment – Europe, How to Assess Writing Reliably and Validly, which presented evidence of the extraordinarily high reliability that has been achieved with Comparative Judgement in assessing primary school pupils' skill in first-language English writing.

Adaptive comparative judgement

Comparative judgement becomes a viable alternative to marking when it is implemented as an adaptive web-based assessment system. In this, the 'scores' (the model parameter for each object) are re-estimated after each 'round' of judgements in which, on average, each object has been judged one more time. In the next round, each script is compared only to another whose current estimated score is similar, which increases the amount of statistical information contained in each judgement. As a result, the estimation procedure is more efficient than random pairing, or any other pre-determined pairing system like those used in classical comparative judgement applications. (Pollitt, 2012).Pollitt, A (2012) The method of Adaptive Comparative Judgement. Assessment in Education: Principles, Policy & Practice. 19: 3, 1-20. DOI:10.1080/0969594X.2012.665354 As with computer-adaptive testing, this adaptivity maximises the efficiency of the estimation procedure, increasing the separation of the scores and reducing the standard errors. The most obvious advantage is that this produces significantly enhanced reliability, compared to assessment by marking, with no loss of validity. Whether adaptive comparative judgement genuinely increases reliability is not certain. (Bramley, Vitello, 2016). Bramley, T and Vitello, S (2016) The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice. 26: 1, 43-58. DOI:10.1080/0969594X.2017.1418734

Current comparative judgement projects

RM Compare

RM Compare is the original adaptive comparative judgement system. The system, originally developed as CompareAssess by the company Digital Assess, formerly TAG Developments, and is designed to run at scale deployments of Adaptive Comparative Judgements and has been used around the world in a wide range of contexts.

Open Source Comparative Judgement Projects

The Digital Platform for the Assessment of Competences (D-PAC) is a consortium with

University of Antwerp The University of Antwerp ( nl, Universiteit Antwerpen) is a major Belgian university located in the city of Antwerp. The official abbreviation is ''UA'', but ''UAntwerpen'' is more recently used. The University of Antwerp has about 20,000 stu ...

, iMinds and Ghent University to create an open source Comparative Judgement application. D-PAC, in collaboration with No More Marking Ltd, have released the algorithms that powe
www.nomoremarking.com
under the GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007

Comparative Judgement

No More Marking
have created an online Comparative Judgement application, along with a repository of useful information.

e-scape

The first application of Comparative Judgement to the direct assessment of students was in a project called

e-scape E-scape was a project run by the Technology Education Research Unit (TERU) at Goldsmiths University of London, England that developed an approach to the authentic assessment of creativity and collaboration based on open-ended but structured activit ...

, led by Prof. Richard Kimbell of London University's Goldsmiths College (Kimbell & Pollitt, 2008).Kimbell R, A and Pollitt A (2008) ''Coursework assessment in high stakes examinations: authenticity, creativity, reliability Third international Rasch measurement conference''. Perth: Western Australia: January. The development work was carried out in collaboration with a number of awarding bodies in a Design & Technology course. Kimbell's team developed a sophisticated and authentic project in which students were required to develop, as far as a prototype, an object such as a children's

pill dispenser Pill dispensers are items which release medication at specified times, to assist patients in adhering to their prescribed medication regime. They may also alert the patient that it is time to take the medication. Some devices can alert a monitorin ...

in two three-hour supervised sessions. The web-based judgement system was designed by Karim Derrick and Declan Lynch from TAG Developments, now a part of Digital Assess, and based on the original MAPS (software) assessment portfolio system, now known as Manage. Goldsmiths, TAG Developments and Pollitt ran three trials, increasing the sample size from 20 to 249 students, and developing both the judging system and the assessment system. There are three pilots, involving Geography and Science as well as the original in Design & Technology.

Primary school writing

In late 2009, TAG Developments and Pollitt trialled a new version of the system for assessing writing. A total of 1000 primary school scripts were evaluated by a team of 54 judges in a simulated national assessment context. The reliability of the resulting scores after each script had been judged 16 times was 0.96, considerably higher than in any other reported study of similar writing assessment. Further development of the system has shown that reliability of 0.93 can be reached after about 9 judgements of each script, when the system is no more expensive than single marking but still much more reliable.

=Further projects

= Several projects are underway at present, in England, Scotland, Ireland, Israel, Singapore and Australia. They range from primary school to university in context, and include both formative and summative assessment, from writing to mathematics. The basic web system is now available on a commercial basis from TAG Assessment (http://www.tagassessment.com), and can be modified to suit specific needs. ACJ has been used by Seery, Canty, Gordon and Lane in the University of Limerick, Ireland to assess undergraduate student work on Initial Teacher Education programmes since 2009. ACJ has also been used by Dr. Bartholomew at Purdue University to assess design portfolios in middle, high-school, and university students. Bartholomew has also used ACJ as a formative assessment teaching and learning tool for open-ended problems.

References

* Pollitt, A (2015) ''On Reliability Bias in ACJ: Valid simulation of Adaptive Comparative Judgement.'' Cambridge Exam Research: Cambridge, UK Available at https://www.researchgate.net/publication/283318012_On_%27Reliability%27_bias_in_ACJ * APA, AERA and NCME (1999) ''Standards for Educational and Psychological Testing.'' * Galton, F (1855) ''Hereditary genius : an inquiry into its laws and consequences.'' London : Macmillan. * Kimbell, R A, Wheeler A, Miller S, and Pollitt A (2007) ''e-scape portfolio assessment (e-solutions for creative assessment in portfolio environments) phase 2 report''. TERU Goldsmiths, University of London * Pollitt, A (2004) ''Let's stop marking exams. Annual Conference of the International Association for Educational Assessment, Philadelphia, June''. Available at http://www.camexam.co.uk publications. * Pollitt, A, (2009) ''Abolishing Marksism, and rescuing validity''. Annual Conference of the International Association for Educational Assessment, Brisbane, September. Available at http://www.camexam.co.uk publications. * Pollitt, A, & Murray, N (1993) ''What raters really pay attention to''. Language Testing Research Colloquium, Cambridge. Republished in Milanovic, M & Saville, N (Eds), Studies in Language Testing 3: Performance Testing, Cognition and Assessment, Cambridge University Press, Cambridge.

External links

RM Compare

No More Marking Ltd.
*

E-scape E-scape was a project run by the Technology Education Research Unit (TERU) at Goldsmiths University of London, England that developed an approach to the authentic assessment of creativity and collaboration based on open-ended but structured activit ...

Rewarding RiskTAG Assessment ACJD-PAC
{{DEFAULTSORT:Adaptive Comparative Judgement School examinations Neuroscience Cognitive psychology Psychophysics Psychometrics