How to choose a reliable language test
Teaching English with Oxford
Providing education leaders and decision-makers with solutions and insights into English learning and assessment.
If you find yourself in the position of choosing or recommending a language test, how can you be sure it's reliable? Before we can answer that question, first we need to understand what reliability actually is. The everyday sense of the word may be familiar, but what does it mean when talking about assessment?
What is test reliability?
Who do you know that you would say is ‘reliable’? It could be a friend who always follows through on their promises and never fails to show up when they say they’re going to. Or do you know someone who always lets you down? In both cases, the common theme is consistency – you can expect a similar outcome each time in your dealings with them.
In the context of assessment, the term ‘reliability’ refers specifically to consistency of measurement. A more reliable test measures more consistently than a less reliable one.
Why is test reliability important?
A few weeks ago, one of my oldest (and most reliable) friends came over to help me build some shelves, which involved cutting five pieces of wood to the same length. To measure out the wood, we had several options:
A.???? deploy a laser measuring device to measure the wood to within a tenth of a millimetre
B.???? unroll a tape measure and mark where to cut with a pencil
C.??? glance quickly at the area where the wood was to go, then start sawing.
Which method is best? Clearly, option A would result in all pieces being nearly identical in length (i.e. a highly consistent measure), but a laser measuring device would have been expensive, and we wouldn’t need such a high level of accuracy to build the shelves. With option C, the pieces would end up being quite different lengths, which would have caused problems. We wanted our measurement to be sufficiently consistent for the task at hand, so we went for option B.
In a similar way to option C, if a test isn’t sufficiently reliable (i.e. the measurement is not consistent enough), the consequences for stakeholders can be problematic. This leads to the next question:
How reliable should tests be?
Reliability is reported as a number between zero and 1. Quantifying a minimum acceptable level of reliability for a particular test depends on a variety of factors. How important is consistency of measurement in a classroom spelling bee? How about a university entrance test? Clearly, more so for the latter. For high-stakes tests such as this, test developers would seek to make reliability as high as possible. For example, in the Oxford Test of English Advanced pilot study, a minimum threshold of .80 for reliability across all parts of the test was adopted.
Why not aim for reliability of 1?
Measuring language proficiency isn’t the same as measuring length. ‘Language proficiency’ is not something we can directly observe. We infer language proficiency as test takers respond to test tasks. As a test cannot contain all possible language tasks, any language test will always be a partial measure of language proficiency. As a result, some variability in test scores is expected. In classical test theory, the score a test taker receives (the ‘observed score’) is made up of a combination of their ‘true score’ and their ‘error score’. The true score is made up purely of the test taker’s ability, and the error score includes factors unrelated to their ability, ranging from things that the test developer can control (such as how well-written the test is) to things they can’t (such as test taker’s mood on the day of the test). Because in any given test administration there will always be things beyond the control of the test developer that feed into the observed score, while these things can (and should) be minimized (Bachman and Palmer, 1990), some level of unreliability will inevitably occur.
How certain can we be about test scores?
For each test score, we can also say how confident we are that this score is an accurate representation of the test taker’s true ability. Reliability is often reported alongside the Standard Error of Measurement (SEM), which indicates the level of confidence in the observed scores. Lower SEM values indicate higher confidence. Tests that report high reliability (i.e. more consistency) might also report a high SEM (i.e. less confidence), so it’s important to consider SEM alongside reliability (Carr, 2011).
How is test reliability calculated?
The method test developers choose to calculate reliability depends on the type of test and what is being investigated.
Consistency over time
Let’s say a developer wants to measure consistency of scores over time. They could use test-retest reliability; administer the same set of questions to the same group of test takers on more than one occasion, then compare the two sets of scores. However, the test takers would need to be willing to sit the test more than once, and they may learn or forget things between administrations, or remember the questions the second time round. To avoid some of these issues, a developer may opt for parallel forms reliability, where equivalent versions of the same test are administered. This technique was used to investigate whether test takers would get the same score on different administrations of the Oxford Placement Test, which, as a Computer Adaptive Test (CAT), delivers different combinations of items to test takers each time.
Internal consistency
If the test is going to be taken only once by each test taker, internal consistency can be investigated. This might take the form of split halves reliability, where the test is divided in half and reliability calculated by comparing the two halves. One problem with this is that the reliability reported will be different depending on how you divide the test. Do you do a straight split down the middle? Do you include the odd-numbered questions in half A and even numbered questions in half B? To get around this, a statistical model such as Cronbach’s alpha (which returns the average of all possible splits) can be used. Cronbach’s alpha was used to evaluate the internal consistency of Oxford Test of English and Oxford Test of English for Schools Reading and Listening modules as part of a recent score comparability report.
Inter-rater reliability
Imagine a Speaking or Writing test where the same test taker responses (or ‘scripts’) are marked by multiple human assessors. If one script were given a score of 10 by one assessor, then we would want a different assessor to give a very similar (or ideally, the same) score. This is desirable because it shouldn’t make a difference which assessor marks the script. To investigate how consistently the assessors mark the same scripts, inter-rater reliability can be calculated. For example, in the Oxford Test of English Advanced pilot study, inter-rater reliability was reported for the Speaking and Writing modules with the Intraclass Correlation Coefficient (ICC). Carefully designed rating scales and watertight processes around assessor training and certification contribute to high levels of inter-rater reliability.
Is high reliability enough on its own?
In a previous blog on test validity, we said that the extent to which a test captures useful and meaningful information in order to be able to make justifiable decisions (for example, decisions about university entrance) is referred to as ‘test validity’. Reliability is a necessary part of test validity because it gives us an indication of how far we can rely on the results for making the decisions that we want to make about the test taker (Weir, 2005). If a test weren’t sufficiently reliable, it would be difficult to justify claims about how the results can be used.?On the other hand, high reliability by itself isn’t enough to claim test validity; consistency would be irrelevant if the wrong thing were being measured. Imagine that you wanted to measure someone’s ability to run 100 metres, but you gave them 50 multiple-choice questions on general-knowledge topics. You might obtain high reliability figures, but the test wouldn’t tell you anything about the ability that you wanted to measure. This may be an absurd example, but it highlights the point that the items on a test need to be fit for purpose.
What should I look for when choosing a test?
In summary, along with evidence that the test scores are valid for the intended use (which may come in the form of test specifications or research reports, reviewing reliability figures in light of the ‘type’ of reliability reported along with the Standard Error of Measurement could help you determine whether a test will be appropriate for you or your students.
References
Oliver Bigland holds an MA in Applied Linguistics from the University of Birmingham. He has taught English in Japan, Spain and the UK. His interests include automating assessment-related processes and generating insights from test-taker response data.
Cambridge Examiner
2 周Brilliant summary that compiles the importance of Reliability especially in high-stakes tests such as the Oxford Tesr of English. Key takeaways for me were that a high level consistency does not actually imply that the test measures the right ability. No wonder it is considered such a contentious issue within language testing.