What Is ‘Test Validity’?

What Is ‘Test Validity’?

‘Test validity’ is crucial to understanding why tests exist. Any educational test has to serve a purpose – it must describe something about the test takers to be useful for making decisions. Tests by necessity are short – perhaps one or two hours. This time must be well organised to capture useful information. For example, in an English language test, the test has to capture examples of language performance so that claims can be made about language proficiency. We cannot capture everything about a test taker’s language ability, but we must capture enough information to make justifiable decisions (about university entry, for example). The extent to which the test has captured useful and meaningful information is what we call ‘test validity’.

For language teachers, understanding test validity is essential for selecting and using tests appropriately and interpreting their results effectively. There are different kinds of tests, which exist for different purposes. For example, a placement test needs to provide information for teachers to distinguish between different levels of proficiency. The test takers can be allocated to different classes which best match their proficiency. A diagnostic test has to provide information about specific language points so that teachers can then decide if remedial sessions are necessary. In contrast, a proficiency test, such as the Oxford Test of English, exists to allow stakeholders, such as university administrators, to make decisions about admissions based on the language proficiency of individual test takers.

Test validity is therefore connected to test use. If a test is used for a specific purpose, then we require evidence that the test is suitable for that purpose. This evidence can take different forms. The different forms of evidence are often presented as different kinds of validity.

Before we look at these in detail, take a moment and write down some different kinds of test validity you can think of (for example, ‘face validity’). A good way to do this is to list some different kinds of tests, such as those mentioned above (placement, diagnostic, proficiency) and note down what kinds of information you would want to know about each test before you decide to use them. This will help to identify different kinds of validity. You can compare your list to the types of validity set out below as you read on.

Types of test validity

Content validity refers to the extent to which test tasks represent the kinds of activities test takers will be required to do in the real world (Bachman, 1990). For example, the Oxford Test of English Advanced contains a Summary task, in which test takers read two texts and write a summary of both in a single text. This is designed to mimic the requirement in university courses of reading multiple book chapters or articles, and then using what they have learned as part of a larger argument.

Construct validity refers to the extent to which a test measures the psychological ‘construct’ it is intended to measure. This is the classic definition of ‘test validity’ offered by Samuel Messick in 1989. This influential definition is still widely used today. What is a ‘construct’? Put simply, it is the thing we are trying to test. ‘Reading’, ‘Speaking’, ‘Writing’ or ‘Listening’ are common constructs in language testing. Take the construct of ‘reading’ for example. How do we know that a reading test is actually measuring reading? How is reading defined and operationalised in the test? Is it testing the kinds of reading we’re interested in?

We might define a construct of reading to include ‘scanning’ or ‘skimming’, which are kinds of expeditious reading. This means ‘the ability to read quickly to extract specific information’. A test might then need to limit the time allowed for such a task, or it won't be measuring the expeditious construct.

Thankfully, with online, computer-adaptive tests, this is straightforward, as time limits can be set for each task. At the same time, we need to exclude any confounding factors which would ‘pollute’ the measurement. For example, test questions are subject to ‘test wiseness’, where test takers exploit poorly written questions to help them guess the answer. To illustrate, imagine a three-option multiple-choice reading question in which the correct option contains a word which is also used in the reading text, but that word does not appear in the incorrect options. The test taker could easily match the word to the text to get the correct answer without actually understanding the question. This means the question becomes too easy and is not testing what we want. This is referred to as ‘construct-irrelevant’ information.

Criterion-related validity (or predictive validity) refers to the relationship between test scores and some external criterion, such as another test or real-world language use (Weir, 2005). For example, comparing scores obtained in language tests to scores obtained in university courses is one way to examine whether the test accurately certifies learners to take part in academic study. This kind of evidence is difficult to come by as it is time-consuming to investigate, requiring the tracking of students from their language test results through to their university experience. Nonetheless, such information is valuable in determining whether a test is suitable for its purpose.

Though not a technical measure of validity, face validity refers to the extent to which a test appears effective in terms of its stated aims to those taking it or those administering it. For example, a test for university entry is likely to contain an Essay task. Test takers, administrators and teachers all expect this. Imagine you see an English language test for university entry that did not contain a written Essay. Would you be happy with it? Would you feel confident about the results? A test without such a component would almost certainly be questioned by administrators, regardless of how rigorous the evidence presented for the other parts of the test might be.

Implications for Stakeholders

So, what does this mean for university admissions officers, employers, or English language teachers? Understanding these aspects of test validity has important implications. Firstly, knowledge of test validity can help teachers develop curricula and pedagogy to ensure that instruction is focused on developing language skills rather than test-taking skills (think about the example of reading discussed above). Secondly, it aids admissions officers and employers in identifying tests with tasks which measure language skills and constructs used in real-world situations, required for academic or professional success (think about the Summary task mentioned above). Finally, awareness of test validity supports the accurate interpretation of test scores. Knowing what test scores mean results in more equitable decision-making on the basis of test scores.

Some contemporary Issues in Test Validity

Recent developments in learning, teaching and testing have had a significant effect on language testing and test validity, due to a combination of technological developments and concerns around test fairness. This final section outlines some of these developments and their respective impacts on testing.

First, a noted shift towards online and adaptive language testing has raised questions about the comparability of online and paper-based tests. Issues such as test security, the impact of technology on test performance, and the digital divide among test-takers have huge implications for test validity (Ockey & Kunnan, 2020). Additionally, the use of automated scoring systems (Xi, 2010) for language tests has grown, particularly for speaking and writing, offering the potential for more objective and efficient scoring. However, concerns remain about the ability of these systems to accurately score complex spoken and written language performances. Additionally, artificial intelligence is rapidly becoming the major focus of language testing research in the form of remote proctoring, and adaptive testing (as used in the Oxford Placement Test and Oxford Test of English suite). Traditional models of validity may be inadequate to address these changes or to account for the kinds of evidence required to demonstrate validity for specific purposes. The field is actively looking towards how test validity as a concept can be informed by machine learning literature.

Second, the increasing diversity of language learners has raised questions about fairness in language tests. Tests must be valid for diverse populations of test takers, avoiding cultural, linguistic, or socioeconomic biases that could disadvantage some test-takers (Kunnan, 2004). In many educational contexts, learners speak a variety of languages. There is concern that the linguistic diversity of these contexts is underserved by existing tasks or test content. What types of language should be included in high-stakes tests, and how well are students served by English language tests in differing contexts?

Conclusion

Despite these challenges, test validity remains a cornerstone of effective language testing, ensuring that test providers continue to focus on ensuring that their tests accurately measure what they are intended to measure. From university admissions officers to employers to English language teachers, understanding the various aspects of test validity is essential for selecting, administering, and interpreting language tests to the benefit of language learners. ‘Test validity’ continues to evolve, particularly with the advent of new technologies and in response to the changing demographics of language learners and how they use language.


References

  • Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press.
  • Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Eds.), European language testing in a global context: Proceedings of the ALTE Barcelona conference July 2001* (pp. 27-48). Cambridge University Press.
  • Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). American Council on Education/Macmillan.
  • Ockey, G. J., & Kunnan, A. J. (2020). The impact of the transition to online language testing on test validity: A review. Language Assessment Quarterly, 17(4), 412-427.
  • Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Palgrave Macmillan.
  • Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27(3), 291-300.


Dr. Nathaniel Owen is Senior Assessment Research and Analysis Manager at Oxford University Press. He holds a PhD in language testing from the University of Leicester specialising in L2 reading processes. In addition to reading processes, his research interests include the interface of language testing and technology, big data analytics, the use of language tests in English-medium instruction contexts, research methods and widening participation in higher education. He has presented work at multiple national and international conferences including Language Testing Forum (LTF), Language Testing Research Colloquium (LTRC), International Association of Teachers of English as a Foreign Language (IATEFL) and Association of Language Testers in Europe (ALTE).

Alison Lewis

Experienced Director of Educational Assessment|International Education|HE Institution Relationship Manager

5 个月

This is such a good overview of test validity. Really good to see the contemporary issues drawn out.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了