Key Principles for Creating High-Quality Assessments

Key Principles for Creating High-Quality Assessments

Student assessments are a key part of the education system. The data and insights they generate provide valuable feedback on the performance of students, teachers, schools, and the education system in general. Additionally, because stakeholders in the system place great emphasis on a good performance in examinations, improving assessments can have positive upstream effects across the system. ‘Teaching to the test’ can promote learning if the test is a good one.

Over the last two decades, Ei has created thousands of assessments at the school level. We have also studied the assessments used in our schools as well as the school-leaving exams. Based on our collective experience over this period, we share certain key principles that, if adhered to, can improve the quality of assessments (and consequently of student learning).

?These principles are:

1.??Testing key concepts and core knowledge, not peripheral facts

2.??Using questions that are unfamiliar either in the way they are framed and/or their context

3.??Having questions covering the entire range of difficulty in a paper

4.??Ensuring that difficult questions are based on ‘good’ sources of difficulty

5.??Using authentic data in questions

6.??Avoiding narrowly defining a large number of competencies and then only having questions testing individual competencies

7.??Using different questions types to test different aspects of student competencies.

8.??Providing test creators access to past student performance data, to use while designing questions

9.??Designing answer rubrics to capture errors and misconceptions

10. Publishing post-examination analysis booklets

11. In large-scale exams, reporting results using scaled scores and percentiles

Assessments are and shall continue to remain a ‘north star’ that guide the actors in the education system. They will therefore continue to have an outsized effect on the priorities of stakeholders. Adhering to these principles, we believe, will help create well-designed assessments that will not remain just a ‘necessary evil’, but positively influence the education system, and subsequently our workforce and society.


Principle 1. Testing key concepts and core knowledge, and not peripheral facts:

Examinations should primarily test a student’s understanding of key concepts. It is also okay if they test for certain facts, as long as those facts are core to the subject. For example, the concepts of compounds and mixtures and the difference between them represent a fundamental understanding about matter. The fact that the Earth's axis is tilted from the perpendicular of the plane of revolution is a core fact which is okay to test. On the other hand, the amount by which the earth's polar circumference is less than its equatorial circumference is an unimportant one and should not be tested. Yet many of our exams test peripheral or trivial facts like these.

Trivial facts should not be tested not just because they can easily be looked up on any mobile phone, but also because these trivial facts may displace core understanding. One way to test if a certain question is a valid one to ask is to check if a high percentage (say 70%) of practicing experts of the subject will answer it correctly. Every physicist will know the difference between compounds and mixtures and all about the tilt of the Earth's axis, but most geographers would probably NOT know that the polar circumference is 72 km less than the equatorial circumference. Questions that test reasoning and higher-order thinking are also more appropriate to real-world tasks and challenges today and hence more important.

The figure below shows a question testing mechanical learning vs. real learning with understanding.

No alt text provided for this image
Figure 1: Asking the definition of a peninsula tests for mechanical learning while the alternative shown expects students to understand the characteristics of peninsulas even if they cannot give the textbook definition.

To summarise, not everything that can be asked should be asked. Rather assessments should focus on concepts and some facts that would serve as a foundation for real-life application or future learning.


Principle 2. Using questions that are unfamiliar either in the way they are framed and/or their context:

Most examinations in our country, both at the school and Board level, tend to have questions that are typical or fit a standard form. Questions rarely use unusual or unfamiliar contexts or forms. So students develop the techniques and confidence to answer those standard questions (often by reproducing the solutions in the textbooks). They learn that when they encounter a question in an exam, they should 'pattern-match' to check which question they have seen in the textbook or class matches it closest and apply the same procedure. Unfortunately, this actually works and yields the expected result with most questions, so the 'learning' is reinforced. Students gain no exposure to having to respond to problems that are presented differently or need to be tackled differently. The process of first trying to understand the problem, then thinking about it and then attempting to solve it step by step, is largely unknown to them.

Thus, this is nothing more than a form of rote learning, where procedures or patterns, if not facts, are memorised. When faced with unfamiliar problems whether in modern tests like PISA, competitive tests or even in unexpected real-life situations, the response is to feel flabbergasted or unprepared or that the question is 'out of syllabus'. Most students lack the confidence to even attempt such questions.

Whether we want to check whether students have really learned concepts, or we want to prepare them for future exams, testw should contain questions that test a prescribed set of concepts in an unfamiliar way.

What do we mean by 'unfamiliar' questions? Questions can be unfamiliar in different ways:

  • they may be framed using real-life contexts (e.g. sports, technology, art, music, market transactions) which are not used for that concept in the textbook
  • they may be framed in the context of contemporary developments (e.g. COVID-19, cryptocurrency, an important current event) which too would be 'new' for them
  • they may integrate concepts taught in different subjects (e.g., show a graph recorded by a seismograph during an earthquake and ask a simple interpretation question in a mathematics test)
  • they may simply test for conceptual understanding, misconceptions or higher-order cognitive skills in any form that has not been discussed in the textbook

Figure 2 shows a question testing a concept related to evaporation in an unfamiliar but real-life context.

No alt text provided for this image

Being able to apply conceptual understanding in unfamiliar contexts is a critical life-skill. Asking such questions in exams would automatically ensure their use in classroom teaching and help develop such skills.

?It is important to note two important points about unfamiliar questions: Firstly, they are not necessarily difficult questions. Once students have understood the problem, it may actually be easy to solve. Secondly, all the questions in a test need not be unfamiliar. Up to 30% - 40% of questions can be familiar to students and thus answerable even by weaker students.

?Finally, creating such questions may seem challenging, and it does require effort. Some tips are discussed in Box 1.

No alt text provided for this image
No alt text provided for this image


Principle 3. Having questions covering the entire range of difficulty in a paper:

A key purpose of most examinations is to discriminate between students of different levels of ability. To do so, they must contain questions that taken together cover the entire range of student ability.

Since there will be test-takers with low, medium as well as high ability levels, the examination must be able to properly discriminate between them. To do this, there must be a good mix of easy, medium and difficult questions. Students with poorer knowledge of subject material will be able to solve only the easiest questions, whereas those with a stronger grasp of the subject matter will answer more difficult questions with only the highest ability students being able to answer the most difficult ones.

How does one know the difficulty level of questions while setting them? This is not easy and while question makers may estimate the difficulty, these estimates are often not correct. The only way to have this information is to pilot items and record the performance data. (If performance data of past items is available, sometimes the difficulty of similar items can be judged reasonably accurately. Also, if a certain groups of experts are proficient in setting questions and then analysing the actual performance, they develop a good sense of student performance on different types of items – though regular pilots are always necessary.)

Currently, it is found that many public examinations tend to have fewer questions at a higher level of difficulty (and in some cases none). For example, the data analysis of one of the past board papers shows that all the multiple-choice questions in the paper have the difficulty parameter in a very narrow range with most of them discriminating among only students with medium ability levels. The presence of too many easy and too few difficult questions skews the distribution of results and also leads to marks inflation (which pushes up college cut-offs and increases pressure on students as a single mark makes a huge difference).

Figure 4 shows the difficulty distribution of questions in a recent ASSET paper. The difficulty parameter represents the average performance of the item. Thus there are items answered correctly by over 90% of students while others were answered correctly by only 12% of students. Furthermore, there are items at almost every level of intermediate difficulty.

No alt text provided for this image
Figure 4: The distribution of items across difficulty levels in a class 9 mathematics ASSET paper

If questions at all difficulty levels are properly represented in a paper, the student results will also form a normal curve (which is correct as student abilities form a normal distribution). This is a necessary (though not sufficient) condition for a good assessment.


Principle 4. Ensuring that difficult questions are based on ‘good’ sources of difficulty:

Based on their content, examination questions can be difficult for ‘good’ or ‘bad’ reasons. For example, students may find certain questions difficult because they test multiple skills simultaneously. Such questions encourage students to engage in higher-order thinking and integrate aspects they have learnt. Such questions can be said to be built on good sources of difficulty.

On the other hand, questions that are based on ‘bad’ sources of difficulty may test irrelevant facts or may require students to engage in tedious calculations.

While ‘good’ sources of difficulty can encourage meaningful learning, ‘bad’ sources of difficulty may cause students to lose interest in the subject (Box 2 lists some common good and bad sources of difficulty.

Further, a good test would also have the different good sources of difficulty represented well and not one of them is overly represented so that at an overall level the test can discriminate well across students of all difficulty levels.

Principle 3 highlighted the importance of assessments containing questions across a range of difficulties. Even within this range, though, questions should be based only on good sources of difficulty.

No alt text provided for this image

Principle 5. Using authentic data in questions:

Questions in exams should, as far as possible, contain authentic data and examples from the real world, even in situations where the use of fictitious examples or data would otherwise suffice. The use of real-life contexts and data in examinations can make questions more engaging, and help students understand the practical importance of their education. Therefore, in addition to testing concepts, these questions become teaching tools in themselves. Their use in examinations will also encourage teachers to structure classroom instruction accordingly. A sample item using authentic data is shown in Figure 5.

No alt text provided for this image
Figure 5: A sample assessment item using authentic data

For example, if scores from sporting competitions are used in a question testing the concept of averages, they should be data from actual sporting events. Similarly, when students studying geography are questioned about plate tectonics, they should be given examples of real tectonic plates and their movements if possible. In language examinations, comprehension passages can be from real texts across domains such as history, science, or economics.

Of course, in some cases, the complexity of information may need to be moderated or simplified to be suitable for the targeted class level.

?

Principle 6. Avoiding narrowly defining a large number of competencies and then mapping questions to individual competencies:

There seems to be a widespread but wrong notion that good education and assessments require a large number of competencies to be listed for each subject, and individual questions in assessments mapped to individual competencies. Further, some seem to believe that merely doing the above will lead to good assessments and by extension, good education. 'Competency Based Education', a laudable goal, is sometimes understood in this narrow sense. In our experience, listing competencies and then mapping questions to competencies are both largely mechanical steps, and may actually increase and not reduce the rote component of an assessment.

The belief that students need to acquire key competencies is valid. However, the idea that this can be achieved in a mechanistic manner – first listing competencies and then creating questions or content that maps to those competencies – is flawed. Only the quality of content and assessments can lead to good teaching or learning, not merely a mapping.

Particularly in examinations, overly specific mapping leads to the use of narrowly structured examination questions that test only particular competencies, and that too in isolation. In fact, good questions that test multiple competencies are usually excluded in such a process because they breach artificially defined boundaries for competencies, making the paper more mechanical.

This problem is present even in ‘advanced’ education systems. In the USA, for example, the Common Core was introduced to establish set standards and competencies for student education, to improve learning outcomes. Though there was a lot more to the Common Core, in many cases, it was treated by teachers merely as a list of standards to be rigidly focussed on through lessons or questions.[1]

While examination boards must establish a set of necessary skills, concepts, and learning outcomes to guide the education system, they should not be overly prescriptive in how questions test them or aim to break them into very fine sub-categories.

Principle 7. Using different questions types to test different aspects of student competencies:

We often hear debates and arguments about how certain types of questions (say objective or subjective, or multiple-choice questions) are inferior or superior to other types of questions. However, the reality is that each question type has its strengths, weaknesses, and suitability based on the subject and the goal of the assessment. It may be said that a comprehensive assessment will have a mix of various types of questions, each used for its own strengths, as described in Box 3.

No alt text provided for this image
No alt text provided for this image
Figure 6: Technology-enhanced items from mathematics and language. Students interact with such questions which not only record details of these interactions but may also adapt based on student responses.


Principle 8. Providing test creators access to past student performance data, to use while designing questions:

Especially for large-scale or summative examinations, test creators should be given access to data on past assessments. This will provide them insights of two types – one, about what items worked and issues, if any, with items and two, about the kinds of student responses and errors.

Knowing which items functioned well and which did not helps create better items for future assessments. Item data may indicate difficulty, discrimination, the extent of guessing, what ability students answered the item correctly and the wrong responses students gave. Though all of this data may not be available for every question, each piece of information provides valuable insights. As mentioned earlier, past item data also helps question makers estimate the difficulty of similar new items and thus create questions of varying difficulty in the paper.

(Knowing areas of student error is useful not just for assessment creators but teachers as well. Principle 10 below talks about the benefits to teachers and future students when this data is shared with them in the form of post-examination assessment booklets.)

Having misconception data in the format shown in Box 4 helps in developing good assessment items testing misconceptions and in creating plausible distractors.

No alt text provided for this image

Principle 9. Designing answer rubrics to capture errors and misconceptions:

For subjective questions in large-scale examinations to be corrected by multiple evaluators, rubrics with clear marking guidelines should be prepared. This helps bring uniformity in the assessment of students by different evaluators.

This should ideally be done in a two-step process. First, provisional rubrics are created along with the question paper based on discussions between question makers and select evaluators. These rubrics assign marks to different answer types. Next, once the test is completed and student answers sheets are available, a sample of them are selected and corrected by a team of experienced evaluators. Final rubrics are then made by accounting for answer types that were not covered in the provisional rubrics but found to occur in the actual answers. The correction by the senior evaluators would also help establish a ‘standard’ grade for each answer by consensus which would also be incorporated in the final rubric which will be shared with all evaluators.

?Well-designed rubrics serve an additional purpose and the final rubric should be designed keeping this in mind – they can capture patterns in responses to subjective questions amongst students. For this, evaluators should assign codes to each answer based on its content and misconceptions contained. For example - A1, A2, and A3 can be codes used to classify different forms of completely correct answers, B1, B2, B3 can classify partially correct answers, and C1. C2, C3 can classify completely incorrect answers. This can facilitate an aggregated analysis of subjective questions and prepare a data pool of common misconceptions for exam-creators to incorporate.

These rubrics should be clear and objective with grading criteria to ensure standardisation. However, they should be used for grading only by subject matter experts who can discern minute subjectivity in student responses. A rubric of a PISA released item is shown in figure 7.

No alt text provided for this image
No alt text provided for this image
Source: PISA Released Items – Science


Principle 10. Publishing post-examination analysis booklets:

After each round of examinations, an aggregated analysis of all questions should be prepared and distributed among teachers, students and parents, documenting trends in common misconceptions, sample answers, etc. ?The version shared with teachers would include additional detailed analyses which would contain a list of methods to address commonly found misconceptions and errors. Both quantitative and qualitative data should be synthesised for these purposes (Box 5 provides an example of how a question representing an important misconception may be presented, along with suggestions for teachers).

This will also create transparency in the process of examinations; all stakeholders will have a clear sense of expectations from exams and can prepare or support accordingly.

These analyses should be made public in a timely manner (3-4 months after an examination cycle) so that their findings can be addressed. This is crucial for any meaningful improvement of examinations, and by extension, the education system.

No alt text provided for this image

Principle 11. In large-scale exams, reporting results using scaled scores and percentiles:

Examinations should use scaled rather than raw scores to report results. In simple language, scaled scores represent a student’s performance on a consistent and standardised scale taking into account the differences in difficulty between different questions. Further, these difficulties are calculated based on actual student performance. Scaled results, therefore, reflect student performance much more correctly than raw scores. Internationally, it is common practice to use scaled scores for most large exams. Even in India most competitive exams for college admissions use scaled scores when reporting results.

Once scaled scores are tabulated, each student’s results should be declared as a percentile and not a percentage, meaning that their results will be expressed relative to other students’ performance. This helps distinguish between student performances at a very fine level, for example, students who have scored the same raw score.

Additionally, for public examinations like Board Exams, this makes establishing ‘cut-off’ scores easier. This is also why competitive assessments like the Joint Entrance Exam (JEE) for engineering college admissions in India use percentiles when declaring their results.

A big advantage of using scaled scores and percentiles is that it facilitates comparability between different examinations and different years. For example, students scoring, say, 93 percentile in 2019 and 2016 can be reliably considered to be of similar ability as also students scoring similar percentiles in admission tests conducted by different states. This can make processes like college admissions very fair without having to worry if a particular Board is 'strict' or 'lenient'.

?

Conclusion:

Given that assessments are and shall continue to remain a ‘north star’ that guide the actors in the education systems, they will always have an outsized effect on the teaching-learning process. Because current assessments in India tend to prioritize rote-learning, they negatively impact the quality of education and therefore are perceived negatively by the public at large. They are considered a ‘necessary evil’ that serve certain functional purposes (sorting students based on ability), but little else.

However, if designed well, assessments possess significant potential to affect positive change throughout the education system. They can provide key feedback on student performance that can enable focussed learning remediation. They also help discriminate between students of different abilities and aptitudes and can help them make informed choices about careers. Well-designed assessments (particularly large-scale assessments) are also a key barometer of our education system that can shine the light on areas that require improvement. They will always have a multiplier effect on the education system and, by extension, our workforce and society. Using well-designed assessments can ensure that this effect is positive.




[1] Loveless, T. (2021, June 3).?Why common core failed. Brookings. Retrieved January 24, 2022, from

https://www.brookings.edu/blog/brown-center-chalkboard/2021/03/18/why-common-core-failed/

要查看或添加评论,请登录

社区洞察

其他会员也浏览了