Health of your Item Bank
Maintaining your Item Bank Health
Item Bank allows you to write a question incrementally, over a period of time. One of the benefits of writing a question in a staggered manner is you can plan the quality control efforts. In the initial stages, it primarily consists of self-review and peer review. But there are certain quantitative measures which can give you a very good idea about the quality of each of the question and thus the health of your Item bank.
Pre-testing the questions
All you need to do is to pre-test the questions which have passed the initial QC on a small but representative sample of students and collect standard data. This data can be analysed quantitatively to arrive at the metrics which can be easily interpreted. Some of the commonly followed quantitative methods, the data required, the method of calculation, and how to interpret and benchmark the result are described below.
Core quantitative methods
1. Mean
2. Facility Index
3. Discrimination Index
The pre-testing exercise doesn’t only indicate the quality issues but also pinpoints the mistakes (such as a wrong option marked as the correct answer) so we can improve the quality of the questions as well as correct all those issues to make the exams a quality affair.
Mean
We are talking about the AM or Arithmetic Mean, generally referred to as average- in this case, the average marks. It is obtained by dividing the total number of marks scored by the number of candidates who appeared for the test.
Calculating Mean
- If 100 students are given a test and their combined score is 2300, then the mean of the marks = 23
Values of mean
- For multiple choice papers, the mean mark should be in the region of 60% of the marks. E.g. if there are 40 multiple choice questions (each of 1 mark) in a given test, a mean of 24 marks would be expected.
- Note, that if there are 4 options to each multiple-choice paper, then the guessing score is 1 in 4 i.e. 10 marks.
Facility Index (p)
Facility index measures how easy a question is. It is given the symbol ‘p’. The p-value tells us the proportion of candidates getting the answer correct. For multiple choice question:
Calculating Facility index (Question No. 1)
Values of Facility index
- The facility value can vary from 1.0 where all the candidates get the answer right to 0.0, where no one gets the right answer (or the answer is wrong!).
- The mean facility of the paper should be between 0.5 and 0.6
- Each distracter should have a facility of 0.05 or more.
Optimal Level of Facility Index
The optimal level for an acceptable p-value depends on the number of options per item. A formula that can be used to compute the optimal level is:
where g = the chance level
for an MCQ with 4 options, g = 0.25 therefore, the optimal level of p for the tests will be 62.5 (or p = 0.625). Questions with more options are more difficult to answer so, as you increase the number of options, you would like to bring down the optimal level- which is what this equation does.
Discrimination Index (d)
This is the correlation of (responses to individual items) with (overall test score). The higher the correlation, the more the item results are consistent with the test as a whole. In other words, this measures whether the candidates who chose a particular option were generally the abler ones.
The logic is, you expect the students who have done well (scored high marks) on the test as a whole, to do well on any of the individual items, compared to the students who haven’t done so well on the test as a whole and vice versa. E.g. on a given question the top 20 % of the students got it correct 80% of the time whereas the bottom 20% got it correct only 35% of the time is a very logical occurrence. So, you expect a strong positive correlation here.
The higher the value of d, the more effective the item is. When d is 1.00, all test takers in the upper group and no test takers in the lower group answered the item correctly. Conversely, if none of the upper group but all of the lower group answered an item correctly, the d value would be -1.00. Both of these circumstances are rare, and you will probably never see a value of 1.00. The range of values for the item discrimination index is -1.00 to 1.00.
A discrimination value of > 0.3 is good, the range of 0.1 to 0.3 is fair, but anything below 0.1 is poor and definitely worth improving. A negative correlation (even a small one) is contrary to the logic and the items with such scores must be revisited.
The 20% bracket is a common one but brackets of 25% or 27% are not uncommon (so top 25% vs bottom 25% value is also used)
Calculating Discrimination index (Question no. 1)
Discrimination for option ‘A’
13 of the top 133 candidates and 24 of the bottom 133 chose option ‘A’. So, discrimination for option ‘A’ is: d = (13-24)/133= -0.083
Discrimination for each of the options
Values of Discrimination index
- d varies from +1 to -1
- The discrimination value of correct answer should be greater than + 0.25
Relationship between facility index and discrimination index
When the facility index is at extremes (p= 1 or 0) then the question doesn’t have the ability to discriminate between the ability of the students. If a question is answered by all you don’t have an idea who is a better or worse student based on that question alone. Similarly, if a question is answered by none again you don’t have any way to discriminate between the level of the students based on that question alone.
In general, as the facility index increases from 0, the ability of the question to discriminate will increase. The ability of the question to discriminate will be maximum when the p-value is between 0.5 to 0.7. Beyond p-value of 0.7, the ability of the question to discriminate again starts to decline till it becomes nil again when the p value becomes 1.
Lower bound for item difficulty
From the above explanation, we know that it is not a very good idea to have too many very difficult questions in the paper (with the p-value below a given bound) since these items will not help us discriminate among the test takers. You can find the lower bound for p-value using the below formula.
where k = number of MCQ questions; n = number of students
For example, where k = 10 and n = 100, the lower bound = 0.15 (approx.)
*Lower bound for p-value for an exam like CAT for IIMs (200 MCQs, about 1 million students) = 0.005. So, now you know why at least some of the questions in that test are so difficult (only around 5000 out of the total students are supposed to solve them!)
A quick glance on the Item Bank coverage and health
A cross tabulation of facility and discrimination values (see table below) is a very good way to know the coverage as well as the health of the Item Bank. The numbers at the cross section (n1, n2 etc.) represent the number of questions in the item bank meeting the criteria. The row in yellow highlight should have no or very small number of questions at any time. The row in blue should have a large number of questions. There should be enough questions representing each of the difficulty levels.
Please note, the division among Low, Medium and High levels of difficulty and Poor, Fair, and Good levels of discrimination are somewhat subjective, the underlying variables are continuous.
Over the lifecycle of a question, the quality control efforts can be hugely assisted if you understand the uses of these quantitative methods, and your Item Bank can be maintained in top health- always ready for the next exam.
Annexure: Live data from Item Analysis and interpretation:
- Based on discrimination value the correct key for Item No. 5 should be D (which has a positive value) whereas it is marked as C (which has a negative value). This must be revisited- either the key is mistyped or the question is a bad quality question.
- Similarly, for Item No. 4, since the discrimination value of B > D, B seems to be the correct Key. Here again, it must be verified for mistyped key, but if that is not the case it must be revisited for review and improvement.
- Based on p-value, only 1 (Item No. 3) out of these 10 questions is of medium difficulty, rest all are of high difficulty. There is not a single low difficulty question.
I have been accumulating quite a bit of domain knowledge and now I think it is
time to share. something in the series. more to come...
Neeraj
Product Management & Marketing
8 年You will find the article interesting if you are into learning & teaching, edutech, assessments, exams, certifications.... #item Bank #question bank #assessment #exam #on demand exam #adaptive learning #question quality #exam software #assessment software