In the previous article, we began our discussion of the valuable information available through conducting an item analysis and we focused on the two most readily available pieces of information. First, of course, is the Difficulty Index. Just as the name implies, this index is an indicator of the difficulty of the item. It is expressed as a percentage and reflects the number of candidates that got that item right out of the total number of candidates that responded to the item. That is if nine out of ten respondents answered an item correctly the index would be .9 or 90%. From this illustration, we can also see that the Index actually has an inverse relationship with the difficulty of the item. That is, the higher the index or the higher percentage the easier the item is.
The second Index we discussed was the Item Discrimination Index. Essentially, this index reflects how the candidates who performed best on the test responded to a specific item when compared to how the candidates who performed the worst on the test responded to that same item. The top 27% of test performers and the bottom 27% of test performers are used for calculating the Discrimination Index and it is expressed as a proportion or percentage of the number in the top group that answered the item correctly in relation to the number in the bottom group that answered the item correctly.
These are both valuable measures in evaluating item performance because of the information they provide and the fact that they tend to be gross measures in regard to their interpretation. In evaluating an exam by utilizing an item analysis it can be seen that any items that have Difficulty levels indicating that almost everyone is getting them right or wrong need to be flagged for scrutiny. Likewise any items that have Discrimination levels indicating that the top 27% did not perform significantly better than the bottom 27% need to be flagged for scrutiny. In particular, any items with a negative index indicating that poorer performers answered the item correctly more often than top performers need to be fixed or removed.
In that regard, that is truly the value of an item analysis for test developers in that it provides information on how items performed and clues to how poor performing items may be improved. For example, an item with a negative discrimination index may actually be miskeyed. This is where the additional information provided by computer programs that provide print outs detailing the responses from all test takers who actually responded to the item along with the numbers of those who did not respond are particularly valuable. Beyond utilizing the “gross measures” that serve to identify ineffective or bad items, utilizing this molecular view of each item provides information that can assist in developing theories as to what was actually going on with a test item when viewed by test takers. We will discuss that more in the final article on item analysis. At this point, we will continue our review of other statistics provided by item analysis and their value in evaluating item and exam performance.
The Difficulty Index of each individual item will reflect on the Mean of the test and tells the test developer how that item is contributing to the Mean, which is the arithmetic average score. Again, the beauty of the Difficulty Index is its simplicity. It is easy to use and easy to interpret. In that regard, items with high indexes indicating that almost everyone passed the item are going to serve to raise the Mean of the test. This also means that the overall test is easy and not providing the necessary discrimination in performance among test takers to truly be useful. On the other hand, if the Difficulty Index is very low, it means the item is difficult and it will tend to depress the Mean of the test. A test heavily populated with difficult items is just as problematic as a test that is too easy. It is possible that the test sample represents a group of weak candidates, but it is more likely that the problem is related to the items being written at a level beyond the level that is necessary for successful job performance. Discerning the difference is part of the art of test writing and will be covered in more detail in the next article.
Two common statistics associated with the Discrimination Index are the Biserial Correlation and the Point Biserial Index. Text books indicate that these measures are time consuming and difficult to compute by hand and that they are also somewhat difficult to interpret. The Biserial Index is intended to develop a measure of the validity of an item, and the Point Biserial Index is indicated to show how the performance on an item correlates to the performance of the test. In the past, computation and interpretation of these indexes made them less than ideal for use by the average practitioner developing tests. Now, commercially available items analysis programs routinely include calculations for them. Still, their formulas and interpretation of results are much more difficult and less intuitively apparent than the Difficulty and Discrimination Indexes.
If test practitioners focus on creating items based on a thorough job analysis and being sure that the items created are an accurate measure of the knowledge and ability identified in that job analysis, they can have a fair amount of confidence in the validity of their items. In that regard, that confidence can be strengthened by having items reviewed by subject matter experts (SME’s) to determine if their content, difficulty level, and reading levels are appropriate for inclusion in the written instrument. Once the exam has been developed and appropriate reviews have been conducted the Item Difficulty and the Discrimination Index can serve as very effective measures in evaluating performance as well as an effective guide for improving test performance which will be the focus of the next article.
This is part two of a three-part series on item analysis in public safety departments. Part 3 will be available on Feb. 29, 2012. In case you missed it, check out Robert Burd’s previous series, Succession Planning in Public Safety.
I am writing to inquire about a concern that faculty at my institution are currently having with the use of item analysis. In the last few years faculty have decided to nullify items (mark all answers as correct for an item) instead of delete items (change the points awarded from 1.0 to 0.0) that have a point biserial of less than .30. This is seen as preventing students that performed well on the exam from loosing credit from difficult items that were answered correctly. I found your website while conducting a search of item analysis.