Any organization using written exams as part of their selection processes that doesn’t take the time to review the information provided by an item analysis is overlooking a treasure trove of information. Without performing an item analysis on a written exam and acting upon the information gained from such an analysis, a jurisdiction truly does not know how that exam is performing.
There are two fundamental concepts involved in test utility and they are test validity and reliability. Simply put, validity refers to whether or not a test measures what it is intended to measure and reliability refers to how consistently a test measures what it is intended to measure. Item performance directly impacts these two concepts. Test items are the basic element for gathering information about test participants. In order for a test to have the necessary validity and reliability to make it worth using, test items have to perform optimally and gather the best information possible as accurately as possible. If test developers do not utilize information available from an item analysis, that test developer has no idea how items are performing. If items are performing poorly then all other statistical analyses become less meaningful and the test is not suited for its intended purpose. Hiring decisions based on the use of the test are questionable since they become another layer of what is ultimately a house of cards that does not have an adequate foundation. In this case, the written exam has not played its role in terms of optimizing accurate selections and supporting the mission of the organization.
An item analysis gives test users and developers a sort of molecular view of their exams and provides critical information that can be utilized to maximize the effectiveness of their tests. In that regard, a word of caution needs to be inserted in regard to note that test developers should be careful in assigning too much importance to item analysis developed from small numbers of test takers. Typically, a sample of one hundred is considered as a minimum for establishing reliable measures. Also, it is important to note that IPMA-HR sets a good example in the use of item analysis in that it conducts analyses of test items in the development of new tests to ensure that only the best performing items are included in their tests. Further, they continue to evaluate item performance on data compiled from customer test administrations of each test which further assists in ensuring the best performing items are used and the effectiveness of tests are maintained at a high level.
Despite the vast amount of information on item analysis and the number of different ways of analyzing the data available from studying individual items, there are two basic pieces of information that specialists in the field agree upon and even non-statisticians can understand and apply. Specifically, these are item difficulty and the discrimination index.
Item difficulty is as straightforward as it sounds. It tells how difficult the item is in terms of those who took the test. It is expressed as a percentage and reflects the number of candidates who answered the item correctly. That is, if 30 candidates respond to a test item and 20 of them select the correct answer, the Item Difficulty Index is .66. The Difficulty Index is expressed as p and the formula is:
It is important to recognize that p represents the percentage of correct responses and it is inversely related to the actual difficulty of the item. In other words, the higher p is, the easier the item. Therefore, a p of .90 indicates an easy item that almost everyone got right and a p of .10 represents a difficult item that almost everyone got wrong. Neither of these items would be particularly desirable on a test since they are not providing the discrimination among candidate performance that is desirable on a written exam.
Item Difficulty levels of .5 yield the maximum discrimination among candidates’ performance since it indicates that half of the candidates got the item correct and half got it incorrect. Ideally, item difficulty should range between .4 and .6 with items exceeding .3 and .7 being marked for review and possible revision.
The Item Discrimination Index measures how well individual items contribute to the measurement of the knowledge or ability the test is intended to measure. Since this measure is affected by the homogeneity of test items, each segment of the test should be considered separately. That is, a portion of a test intended to measure ability to perform mathematical computations should not have a Discrimination Index computed along with the portion of a test intended to measure knowledge of grammar.
The Item Discrimination Index is intended to reflect the relationship between performance on an item and the total test. Based on the way this index is computed, it can also be thought of as telling the test developer how those who did well on the test performed on the item compared to how those who did poorly on the test performed on the item. This index is computed by dividing the individuals who took the test into three groups: the 27% on the bottom 27% on the top, and the middle 46%. Since calculations are not done on the middle group, it is often said that the individuals taking the test are divided into two groups, top and bottom, but since computations use the top and bottom 27%, a middle group is automatically created.
Fortunately, numerous computer programs are available for calculating the top and bottom 27% and determining the Discrimination Index such as the one available from the University of Maryland Scantron Scoring software (available from IPMA-HR by using our scoring service to score your tests). Since this information is so readily available, it is important to understand it and be able to apply it to improving the performance of written exams. Essentially, the calculations these programs are performing involves selecting the top and bottom 27% from the test and then looking at the number of correct responses in each group.
Based on a test sample of 100, there would be 27 in both the top and bottom groups. If 20 candidates in the top group answered the item correctly and 9 people in the bottom group answered the item correctly we would subtract 9 from 20 and then divide that number by the 27 in the top group. The result would be a Discrimination Index for that item of .40. That is:
Eleven divided by twenty-seven equals .40, which in this case indicates that 40% more people in the top group got the item correct than those in the bottom group. The Item Discrimination Index ranges from -1.00 to +1.00. That is, the index can range from everyone in the bottom group getting the item right and everyone in the top group getting it wrong to everyone in the top group getting the item right and everyone in the bottom group getting it wrong. When expressed in these terms it can be seen how this index takes into consideration candidates’ overall performance on the test and a comparison of how the best performers on the test did on the item in proportion to how the worst test performers did on the item.
Items with negative Discrimination Indices need to be eliminated if the problem with the item cannot be diagnosed and remedied. Item Discrimination Indices above .30 are generally considered acceptable with items below that being needing careful scrutiny for editing that could improve their performance. In the two articles that will follow, we will look at the additional information available from the typical job analysis and discuss how this information can be utilized to improve test performance.
This is part one of a three-part series on item analysis in public safety departments. Part 2 will be available to read on the Assessment Services Review on Feb. 22, 2012. Part 3 will be available on Feb. 29, 2012. In case you missed it, check out Robert Burd’s previous series, Succession Planning in Public Safety.