Language Testing: Part Two
Geoff Jordan
PhD Supervisor at University of Wales Trinity Saint David. Challenging Coursebook-driven ELT
Norm-referenced and Criterion-referenced Tests
??????????? First, we must distinguish between normative and criterion-referenced testing. The normative approach is the dominant paradigm in educational testing generally. It compares individuals to each other, the test score indicating the position of an individual in relation to others, comparing them all to a “norm” – what is considered a “normal” score. Most of the high-stakes language tests currently used in the world are norm-referenced tests, and if the purpose of testing is to distribute scarce resources like university places fairly, it seems logical to use a test that separates out the test takers in such a way that the highest scorers deserve places at the most prestigious institutions. We’ve already seen in Part One that unintended consequences need to be considered, and we’ll see more of this below.
??????????? The second paradigm is that of criterion-referenced testing. Criterion-referenced tests help with decisions about whether an individual test taker has achieved a pre-specified criterion, or standard, that is required for a particular decision context. Fulcher (2010) gives the example of the International Civil Aviation Authority’s requirement that air traffic controllers achieve a criterion level of English before they may practise as air traffic controllers. The purpose of this test is not to select the best speakers of English to be air traffic controllers, but to establish a criterion by which an individual can be classified as ‘operationally proficient’.
How Norm-referenced Tests Work
??????????? Norm-referenced tests are constructed to provide information about the relative status of members of a group; they allow comparisons between a test-taker’s score to the score distribution, i.e., mean and standard deviation of a norm group. Before I disappear in a fog of stats terms, here are the basics:
Bell Curve: The bell curve is a graph that shows the percentage of test-takers who score low to high on a test. When all scores are plotted on a graph, it forms a bell shape. Most scores fall close to the middle with few scores falling outside the average high or low.
Standard Deviation: The bell curve is measured in units called standard deviations. A standard deviation is how spread out the numbers or values are in a set of data. It tells how far a student’s standard score is from the average or mean. The closer the standard score is to the average, the smaller the standard deviation.
Mean: The mean is in the middle of the bell curve or at the 50th percentile. Most tests have a mean of 100.
Types of Score
Raw Scores: Raw scores are scores that describe the number of correct answers on a test or the number of tasks performed correctly. For example, if a student answered 50 out of 100 questions correctly, they would receive a raw score of 50. Raw scores are converted into standard scores, percentile ranks, and grade-equivalent scores for reporting. ???
Standard Score: Standard scores are raw scores that have been converted to have a mean and a standard deviation so that the scores can be compared at different grades or age groups by converting the scores to the same numerical scale. These scores reflect a student’s rank compared to others. They indicate how far above or below the mean or average the individual scores fall. For example, if the test’s mean is 100 and the standard deviation is 15, a score of 115 places the score one standard deviation above the mean.
Percentiles: Percentiles are probably the most commonly used test score in language testing. A percentile is a score that indicates the rank of the student compared to others the same age or grade. For example, a percentile score of 75 indicates that 75% of the students who took the same standardized test received the same score or lower.
??????????? We know that most scores are fairly close to the mean. This is the score that splits the distribution of scores into two. In fact, around 68 per cent of all scores cluster closely to the mean, with approximately 34 per cent just above the mean, and 34 per cent just below the mean. As we move away from the mean the scores in the distribution become more extreme, and so less common. It is very rare for test takers to get all items correct, just as it is very rare for them to get all items incorrect. But there are a few in every large group who do exceptionally well, or exceptionally poorly. The curve of normal distribution tells us what the probability is that a test taker could have got the score they have, given the place of the score in a particular distribution. And this is why we can say that a score is ‘exceptional’ or ‘in the top 10 per cent’, or ‘just a little better than average’.
OK? Let’s resume.
??????????? A test score is arrived at by giving the test takers a number of items or tasks to do. The responses to these items are scored, usually as correct (1) or incorrect (0). The number of correct responses for each individual is then added up to arrive at a total raw score. Fulcher uses the example of a test with 29 items, where each item is scored as correct or incorrect, and 25 language learners take the test. (I should say that these are data Fulcher uses from a very bad test, and he makes some great use of the data which I ignore. One more reason to read the whole book.) Here are the results from the lowest to the highest.
1 1 2 3 5 6 6 7 8 10 10 11 11 11 13 13 14 15 15 16 17 18 25 27 28
The scores are presented visually in this histogram Fulcher, 2010, p. 38)
and we see that the most frequent score is 11. We are also interested in the score that falls in the middle of the distribution, the median score, which is 11. The most useful description of the mid-point for norm-referenced tests is the mean, calculated by adding all the scores together and dividing the total by the number of test takers. The mean score is 11.72. From this, we can calculate the deviation score, by subtracting the mean from the total score, and from these data, using the formula: the square root of the sum of the squared deviation scores, divided by N, we deduce that the standard deviation is 7.5.
??????????? If we can place the figures back on to a curve of normal distribution as follows (Fulcher, 2010, p. 40), the mean (zero) is in the centre, and for each standard deviation (marked on the diagram as –3sd to +3sd), we add or subtract 7.5. So, for example, the score we would expect at one standard deviation above the mean = 11.72 + 7.5 = 19.2 (rounded to one decimal place). The most important observation is that if a learner scores 19, ?we know that approximately 15.86 per cent of the test takers are expected to score higher, and 84.12 per cent of test takers are expected to score lower. We know this because of the probability of scores occurring under the normal curve. The meaning of the score is therefore its place on the scale measured in standard deviations.
??????????? Actually, z scores are even more accurate, and without going into how they’re calculated, these are the scores most frequently used. But since z scores are represented by numbers such as -0.6; 1.3, -1.16, etc., scores on high-stakes tests are given quite differently. For example, in the Gaokao exam, discussed in Part 1, the reported score ranges from 100 to 900. The Mean = 500; Standard deviation = 100; and Range = 100 – 900 (i.e. – and + 4 standard deviations from the mean). The z-scores (an abstraction from raw scores in terms of the normal curve of distribution) are transformed using a stats formula to create a scale upon which scores are reported. In the Gaokao, test takers can find their position in the population by logging on to the test website which tells them their score and how well they did in relation to all other test takers. They know that their score meaning is their place on a curve of normal distribution as expressed through the standardised test score, and that this will determine the university or college they will attend.
??????????? Fulcher (2010) takes two test scores from two hypothetical students. Hui scores 717 on the standardised scale. Looking at the table Fulcher supplies in an appendix, we discover that the entry for this cell is 9850, or 98.5 per cent. This means that Hui is in the top 100 – 98.5 = 2 per cent of the test taking population. This is wonderful news. Zhi, on the other hand has a score of 450. As the table in the appendix does not go below 500, we can select the entry for 550 instead, but treat the number in the cell as those who are above Zhi, rather than those below. This says that 69.15 per cent of test takers are expected to score higher than Zhi. Alas, Zhi will probably have to be content with a second- or third-choice university.
Judging the test
??????????? So much for the technical aspects of determining the meaning of scores in normative testing (in fact, it’s a hopelessly brief summary – see Fulcher (2010, Chapter 2 for a proper account), but the question remains: how good is the test? I’ll give an even more hopelessly brief answer to this question by mentioning validity and reliability. By validity is meant the degree to which a test or examination measures what it purports to measure. However, Messick’s (1989) work expanded our understanding of validity in such a way that it is now seen as a single concept, but with a number of different aspects.
??????????? First, consequential validity extends the possible responsibility of the test developer to all uses of the test. It raises the question of the extent to which the score is relevant and useful to any decisions that might be made on the basis of scores, and whether the use of the test to make those decisions has positive consequences for test takers. The question of relevance and usefulness relates to whether it can be shown that the inferences we draw from a test score about the knowledge, skills and abilities of a test taker are justified. This is the substantive aspect of validity that replaces the traditional definition above.
??????????? Next is the structural aspect. If we claim that a test provides information on a number of different skills or abilities, it should be structured and scored according to the skills and abilities of interest.
??????????? Thirdly, the content of the test should be reasonably representative of the content of a course of study, or of a particular domain (such as ‘aviation English’ or ‘travel Spanish’) in which we are interested. We often wish the test score to be meaningful beyond the immediate questions or tasks on a particular test, as we cannot put all content, situations and tasks on any test; it would simply be too long. So the fourth aspect is generalisability of score meaning beyond the test itself, or whether it is predictive of ability in contexts beyond those modelled in the test.
??????????? Finally, there is the external aspect, or the relationship of the scores on the test to other measures of the same, or different, skills and abilities. We would hope that tests of a particular skill would provide similar results. Convergence gives us more confidence in the test outcomes.
??????????? Measures of validity try to ensure that tests provide a strong link between inferences and decisions, and that test use has a positive impact on people and institutions. Whatever the test is used for, we need a convincing argument that it is useful for its purpose.
??????????? As for reliability, Fulcher (2010) cites Lado’s !961 definition: “Does a test yield the same scores one day and the next if there has been no instruction intervening? That is, does the test yield dependable scores in the sense that they will not fluctuate very much so that we may know that the score obtained by a student is pretty close to the score he would obtain if we gave the test again? If it does, the test is reliable”. Fulcher spends some time explaining different measures of reliability but I won’t discuss them here.? Let’s go now to modern language testing, for which I paraphrase bits of Jordan & Long (2022), Chapter 11.
Modern Language Testing
??????????? The first scientific attempts to make language testing reliable and interpretable employed discrete-point testing. The same discrete-point test was administered to large samples of examinees, and researchers, including John Oller (1979), showed how to produce a .97 (almost perfect) Pearson coefficient of internal reliability, provided the numbers in the sample are large enough and the range of language ability across test takers varied enough. Discrete-point tests focused on isolated grammar points and tested receptive language ability only. Advantages included their many data points, both human (test takers) and grammar targets (discrete points). They were easy to construct (fill in blanks or multiple choice), and easy to score (e.g., 1 point per item) and they could report reliability, for example, as a correlation coefficient.
??????????? But they weren’t valid. What did it mean to get some number of points on a grammar test? What could the learners do in the L2 if they had a particular total number of points on a test? How was a higher score better than a lower score when it came to performance? Was a student who scored 90 points really better than another student who scored 70? There was no way to tell. This is because discrete-points tests measure metalinguistic knowledge, which is knowledge about language, and that kind of (declarative) knowledge is not readily transformed into the ability to use language for real-world purposes.
领英推荐
Skills-based language testing
??????????? For a while, testers tried to measure language skills separately, beginning with the receptive skills, listening and reading. As with discrete-point testing, the productive skills, speaking and writing, were often ignored because they were difficult to test with large numbers of students, for example, international students throughout the world. However, language typically is not used one skill at a time. For example, making a telephone call, service encounters, and participating in a graduate seminar all involve both listening and speaking and sometimes reading as well.
??????????? Another problem is that generic labels like “listening,” “speaking,” “reading” and “writing” are just that, generic and unspecified. What does it mean to say that someone is good at reading? What kind of reading? Reading what kind of material? At what speed? The same can be said of the other skills. Is someone who can “listen to the radio” able to listen to the news and/or to a sports commentary equally well? If someone gets a high score in “speaking, “what does that mean? Can the person speak fluently or as politely or informally as appropriate? And what of writing ability? If one can write a good test essay, does that transfer to future writing demands such as emails, lecture notes, or formal memos? Once again, there was no way to tell, and there was no way to predict performance in real-world discourse domains (academic or occupational).
Proficiency testing
??????????? Picking up speed in the 1970s, with major backing from profit-making commercial testing companies, such as Cambridge Assessment, the Educational Testing Service (ETS), and the International English Language Testing System (IELTS); government entities, such as the British Council, the European Council, and the United States Government Interagency Language RoundTable (ILR); and academic organizations, such as the American Council of Teachers of Foreign Languages, there was a spate of data-free proficiency test development, where “proficiency” became an epiphenomenon, capable of being divisible into levels on a proficiency rating scale. To determine these levels, groups of people gathered together to write descriptions (proficiency level descriptors), usually relying on the intuitions of experienced teachers to tell them which descriptors belonged in which level, and produced a scale of proficiency consisting of say, three, four, or six levels on their particular scale.
??????????? For example, the ACTFL (American Council for the Teaching of Foreign Languages, 1985) proficiency test has the following levels: Novice, Intermediate, Advanced, Superior and Distinguished. The first four levels are further sub-divided into low, mid and high sub-levels. A sample ACTFL rating could be “Advanced Mid.” Meanwhile, the ILR proficiency scale has six levels of proficiency: No, Memorized, Elementary, Limited Working, General Professional, Advanced Professional and Functionally Native. These levels are further divided into basic and plus levels. For example, a rating at Level 2, “Limited Working Proficiency,’ can be ILR 2 or ILR 2+. Finally, the Council of Europe’s Common European Framework of Reference (CEFR) for languages comprises three levels, each subdivided into two sub-levels: A1 and A2 (Basic User), B1 and B2 (Independent User), and C1 and C2 (Proficient User).
??????????? Levels of proficiency scales are no more informative than scores on a discrete-point test, just less reliable because they are impressionistic. Proficiency levels appear to be more informative than discrete-point tests due to the skill-level descriptions and to the “can-do” statements that have been added to the levels over the decades (once again not empirically based); however, even a cursory examination shows that these are based on impressionistic judgements. As Long, Gor & Jackson (2012, p. 103) pointed out, “the characterizations are sometimes so vague and general as to require considerable imagination on the reader’s part.” Take, for example, the description of CEFR level B1: “Can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. Can deal with most situations likely to arise whilst travelling in an area where the language is spoken. Can produce simple connected text on topics which are familiar or of personal interest. Can describe experiences and events, dreams, hopes & ambitions and briefly give reasons and explanations for opinions and plans” (Council of Europe, n.d.). Long et al stress that these descriptions can obviously mean very different things to different people and in practice.
??????????? Only zero and near-native proficiency levels are truly measurable. We know this from the results from countless empirical SLA that have tried to identify the advanced learner, which has required the ability to distinguish near-native speakers from true native speakers. Results of these studies consistently show such distinctions are possible provided measures are sufficiently sensitive (Hyltenstam, 2016). Any other distinctions along proficiency scales are largely impressionistic, which the language assessment field needs to get away from if it is to be taken seriously as evidence-based.
??????????? Beyond the proficiency scale descriptors, there are numerous problems in the tests that elicit language samples on which scores and ratings are based. For example, proficiency tests typically employ speaking prompts and reading texts which purport to have been “levelled,” i.e., judged to aim at the level concerned. This is nonsense. Apart from highly specialized material, all prompts and all texts can be responded to or read at some level; the amount of information conveyed or understood will simply vary as function of language ability. Moreover, language sample elicitation is affected by knowledge of topics and culture. And, in elicitation systems such as ACTFL’s Oral Proficiency Interview, results are at least partially dependent on the test interlocutor’s skill in adjusting to the test taker’s perceived level.
??????????? Finally, it practically goes without saying that since they are not empirically based, there are no established correspondences among the major proficiency scales in use.? And like with discrete-point and skills-based testing, there is little evidence that proficiency ratings are predictive of success in any language use domain. Even if a test taker can succeed in the testing context, there is no way to tell whether this means the person will succeed outside that context, for example in using language for professional purposes.
The Common European Frame of Reference (CEFR) Proficiency Scale ?
Current English language testing is informed by the CEFR proficiency scale, which places learners of English as an L2 somewhere on a line from ‘can-do-hardly-anything’ to ‘can-do-just-about all-of-it’. To repeat what was said above, the CEFR scale descriptors are based entirely on intuitive teacher judgments rather than on samples of performance. Since the scales have no empirical basis or any basis in theory, or in SLA research, they’re “Frankenstein scales”, as Fulcher calls them, unrelated to any specific communicative context, unable to provide a measure of any particular communicative language ability. To quote Fulcher (2010) again:
…. we cannot make the assumption that abilities do develop in the way implied by the hierarchical structure of the scales. The scaling methodology assumes that all descriptors define a statistically unidimensional scale, but it has long been known that the assumed linearity of such scales does not equate to how learners actually acquire language or communicative abilities (Fulcher 1996b, Hulstijn 2007, Meisel 1980). Statistical and psychological unidimensionality are not equivalent, as we have long been aware (Henning 1992). The pedagogic notion of “climbing the CEFR ladder” is therefore na?ve in the extreme (Westhoff 2007: 678). Finally, post-hoc attempts to produce benchmark samples showing typical performance at levels inevitably fall prey to the same critique as similar ACTFL studies in the 1980s, that the system states purely analytic truths: “things are true by definition only” (Lantolf and Frawley 1985: 339), and these definitions are both circular and reductive (Fulcher 2008: 170-171). The reification of the CEFR is therefore not theoretically justified.
Current English language testing uses the CEFR scale in three types of test: first, placement tests, which assign students to a CEFR level, from A1 to C2, where an appropriate course of English, guided by an appropriate coursebook, awaits them; second, progress tests, which are used to decide if students are ready or not for their next course of English; and third, high-stakes-decision proficiency tests (a multi-billion-dollar commercial activity in its own right), which are used purportedly to determine students' current proficiency level.
The key place of testing in the ELT industry should already be clear (exam preparation materials are a lucrative part of publishing companies' business, and most courses of English provided by schools and institutes at all three educational levels start and finish with a test), but perhaps the best illustration of how language testing informs current ELT practice is the Pearson Global Scale of English (GSE), which allows for much more finely grained measurement than that attempted in the CEFR. In the Pearson scale, there are 2,000 can-do descriptors called “Learning Objectives”; over 450 “Grammar Objectives”; 39,000 “Vocabulary items”; and 80,000 “Collocations”, all tagged to nine different levels of proficiency (Pearson, 2019).? Pearson’s GSE comprises four distinct parts, which together create what they proudly describe as “an overall English learning ecosystem” (Pearson, 2019, p.2.). The parts are:?
?????????? The scale itself – a granular, precise scale of proficiency aligned to the CEFR.
?????????? GSE Learning Objectives – over 1,800 “can-do” statements that provide context for teachers and learners across reading, writing, speaking and listening.
?????????? Course Materials – digital and printed materials, most importantly, series of General English coursebooks.
?????????? Assessments – Placement, Progress and Pearson Test of English Academic tests.?
Pearson say that while their GSE “reinforces” the CEFR as a tool for standards-based assessment, it goes much further, providing the definitive, all-inclusive package for learning English, including placement, progress and proficiency tests, syllabi and materials for each of the nine levels, and a complete range of teacher training and development materials. In this way the language learning process is finally and definitively reified: the abstract concepts of “granular descriptors” are converted into real entities, and it is assumed that learners move unidimensionally along a line from 10 to 90, making steady, linear progress along a list of can-do statements laid out in an easy-to-difficult sequence, leading inexorably, triumphantly, to the ability to use the L2 successfully for whatever communicative purpose you care to mention. It is the marketing division’s dream, and it shows just how far the commodification of ELT has already come. Finally, I’ll look at one example of a high-stake test: the IELTS. ?
The IELTS Test
IELTS started life in 1980 as the ELTS (the English Proficiency Test Battery), designed and administered jointly by the British Council and the University of Cambridge Local Examinations Syndicate. In 1989, The Australia branch of IDP (International Development Program) joined the British Council and the Cambridge Assessment group, and together they launched the IELTS (Davies, 2007). The British Council (Future Learn, 2020) describe the IELTS as "the world’s most popular English language test for higher education and global migration". More than 3 million people took the IELTS exam in 2016; the test is currently administered at approximately 1,100 venues in 140 countries at a rate of up to four times a month and is recognized by over 10,000 organizations (test-users) globally (W.S. Pearson, 2019). There are two versions of IELTS: Academic and General Training. The Listening and Speaking parts are the same for both tests, b ut the subject matter of the Reading and Writing sections differ. The total test time is 2 hours and 45 minutes. The test scores are converted to a "band" score from 1 to 9, which, of course, conform to the CEFR levels: Band 2 and under is equivalent to the CEFR A1 and Band 9 is equivalent to CEFR C2.
When we look at the weaknesses of the test, we may begin with the evidence of a bias towards the linguistic norms of inner-circle Englishes, particularly those of the United Kingdom, the United States, and Australia, which confer unfair advantages to candidates from certain linguistic backgrounds closely associated with inner-circle norms, such as Commonwealth countries to British English or Mexico to American English. This bias is particularly apparent in the accents heard in the listening test. Second, the writing test features notable idiosyncrasies. Pearson (2019) cites Moore and Morton's (2005) paper, which analyzes the criteria used in the IELTS Academic Writing Test and demonstrates that they promote a peculiar IELTS writing genre, more like the spontaneous ‘public letter-to-the-editor’ genre than? the genre found in academic journals.?
Next, the speaking test has received much criticism. Some of the criticism is aimed at poor content. For example, Roshan (2013) highlights cultural bias, citing Khan (2006), who reports on his experiences as an IELTS examiner in Bangladesh, and, on the basis of data collected from 18 local examiners, claims that test manifests cultural biases inherent in topics, vocabulary, terminology and question patterns of the speaking test. Khan gives the example of the difficulty Bangladeshi IELTS candidates had in responding to cues about "holidays” and “souvenirs”. Given that, at least in 2006, tourism within Bangladesh was extremely limited, due to a general lack of financial resources, these words did not exist in the candidates “linguistic and cultural repertoire”.
??????????? The criteria for rating candidates speaking ability have also come under fire. Roshan cites the Read and Nation (2006) study, where examiner inconsistency in rating lexical resources was particularly noticeable, although rating lexical resources is a distinct, and important component in the IELTS speaking test rating scales. There are also the effects of financial factors. To save on costs, the IELTS speaking test relies on a single examiner, despite general agreement among experts that at least two independent ratings for each individual speaking test sample are required in order to minimize inconsistency within the individual ratings (see, for example, Bachman, 2010). In a further attempt to increase efficiency, the IELTS interview has been cut from four to three parts and now has a time limit of 11 to 14 minutes. We have already discussed the inherent weaknesses in any test that uses proficiency scale descriptors to place samples of candidates' oral production on a band of 1 to 7.? If we add to those weaknesses the fact that the oral samples comprise very short responses to three cues, that the cues are sometimes culturally biased, that the rating criteria are not evenly applied, and that the sample is rated by a sole examiner, we surely have good grounds to question the reliability, validity and fairness of the test scores.
??????????? Moving to concerns about the administration and management of the test, there is first the issue of discrimination based on economic inequality. W.S. Pearson notes that the test "exacts a notable economic burden on its test-takers, particularly those who do not achieve their required band scores first time around" (W.S. Pearson, 2019, p. 281). The test fees are high and vary significantly - from the equivalent of approximately US$150 in Egypt to double that in China, a difference explained more by Chinese students’ desire to study abroad than by any international differences in administration or management costs. Further costs to test-takers include possible transport and accommodation costs, preparation materials, and exam preparation classes or courses. Such are the expenses involved in taking the IELTS tests that they evidently discriminate against those with lower economic means and make it impossible for some people to take the test multiple times in order to achieve the required score.
??????????? W.S. Pearson (2019) also points out that the owners of IELTS produce and promote commercial IELTS preparation content, which takes the form of printed and on-line materials and teacher-led courses. These make further financial demands on the test-takers, and while some free online preparation materials are made available on the IELTS website, full access to the materials costs approximately US$52, and is free only for candidates who do the test or a preparation course with the British Council. Likewise, details of the criteria used to assess the IELTS writing test are only freely available to British Council candidates; all other candidates are charged approximately US$55 for this important information. Finally, it should be noted that it is common, for those who can afford it, to take the IELTS multiple times in an attempt to improve their scores, and that the score obtained in an IELTS test is only valid for two years.???
??????????? We come now to the uses to which the IELTS tests are put. We have seen that IELTS Academic is used by tertiary education institutions and universities all over the world to regulate the acceptance of overseas students. Its suitability for this purpose is severely undermined by the essential flaws in its design, as discussed above. Even if these flaws were addressed, it is extremely unlikely that the test could ever be fit for purpose. Those who take the IELTS Academic are not a homogenous group: the English needs of a nursing assistant have little in common with those a post-doctoral student of organic chemistry, for example. Pilcher and Richards (2017) conducted interviews and focus groups with lecturers in the subject areas of Design, Nursing, Engineering, Business, Computing and Psychology, and researched the English required in each subject. They concluded that determining English preparedness should be undertaken within the subject context, and that "it is necessary to challenge the power invested in IELTS". What makes the situation even worse is that although those who run IELTS periodically point out to academic institutions that they must carefully consider factors such as age, educational background, and first language when interpreting a candidate's scores, it seems that the test scores are, in fact, used without taking any notice of such factors (Coleman et al. 2003; Hyatt 2013, cited in W.S. Pearson, 2019).???
????? ?????? The simplicity and efficiency with which such test scores can be processed strengthens the perception that IELTS scores are ‘an easy short cut … concerning admissions to English-medium HE [higher education] institutions’ (Hall 2009: 327).? Rather than attempt to carefully interpret the scores with the help of information provided by the IELTS partners, users of the IELTS tend towards the unquestioned acceptance of the predictive power of its scores: If an overseas student does not achieve the required score, their application for admission to the university is normally turned down. Even more questionable is the use of the test by employers to assess prospective employees’ ability to function in the workplace, despite the fact that, in most cases, none of the test tasks closely corresponds with what an employee is expected to do in the job. Worst of all, band scores in the test are used by some national governments as benchmarks for migration: It is, we suggest, quite simply immoral to use a score on an IELTS test to deny a person's application for immigration.?
??????????? In conclusion, those who seek to study at universities abroad or to work for a number of large multinational companies, or to migrate, are forced to engage with IELTS (or a comparable test such as TOEF) on the terms set by the test owners, conferring on the owners considerable global power and influence; and they suffer dire consequences if they fail to achieve the required mark in tests which, in a great many cases, are not fit for purpose.
In Part Three, I’ll look at criterion-based performance tests. Meanwhile, if you want to see how one of ELT’s most celebrated gurus - “The Maestro” as he’s often called – explains why teachers should love testing, follow this link:
References
American Council for the Teaching of Foreign Languages. (1985). ACTFL Proficiency Guidelines (Rev. ed.). ACTFL Materials Center.
Fulcher, G. (2010). Practical language testing. Hodder Education.
Hyltenstam, K. (2016). Advanced Proficiency and Exceptional Ability in Second Languages. De Gruyter Mouton.
Jordan, G., and Long, M. (2022). ELT: Now and How it Could Be. Cambridge Scholars. ??
Assistant Professor in Applied Linguistics
1 年So interesting stuff. Please keep sharing other stuff in language testing as well.
Teaching English since 2002 | MA student (Applied Linguistics and TESOL) | BA (Hons) in TEFL with Italian | UCLES CELTA | JLPT N3 | HSK3 | DELF B1 | DELE B1
1 年Really enjoy reading these blog posts. A lot of useful information and insight.
Teacher Trainer | Celta tutor| French and English teacher | Educational Consultant
1 年Thank you so much for your posts which are always insightful!
?????????????? English Language teaching professional Language school principal | Materials writer and course designer | Teacher and teacher trainer | Conference presenter Regional Principal, Kaplan Languages Group
1 年Geoff Jordan Fascinating stuff, as ever. Rachel Kimber, there is much here of interest to you, I feel.
Geoff Jordan These posts are quite valuable. Thanks a lot for sharing.