In the two previous articles, we looked at the statistical and technical aspects of item analysis. Individual test developers will view the statistical computations and their value differently based upon their knowledge of statistics and their understanding of their application. However, a test developer or test user with a rudimentary understanding of item analysis can still make accurate decisions regarding the effectiveness of test items and therefore, written exams. As we emphasized previously, IPMA-HR conducts item analyses on potential test items in their test development process and maintains item analysis data from successive administrations of all exams. These practices ensure that only items that perform well continue to be utilized and, in addition, this practice reflects a standard that all test developers should employ. Also note that for our discussion, we will be focusing on typical four response multiple choice items and true false items.
Effective utilization of item analysis information for item and test revision is where science meets art. This process of “cleaning” up test items and tests requires utilization of the basic information from an item analysis, effective analysis of the applicant response data and application of the information available for developing good test items. There is an extensive amount of scholarly information available on item writing as well as response theory and the effective practitioner should take the time to review some of this information prior to writing or “repairing” test items. It should also be noted that even though the information provided in this article focuses on actual test developers, it can also be extremely valuable for those who purchase or lease tests since it can assist them in evaluating the quality of tests they are considering.
As indicated in the previous articles, the first review of item analysis data should look at Item Difficulty and Item Discrimination. Item difficulty ideally should range from .4 to .6 with items below .3 and above .7 being flagged for review. Items with a negative Discrimination Index and items with an Index below .3 should also be flagged for review. Once this review has been completed, the test developer has the opportunity to apply analytical abilities in evaluating the item response patterns. Computer programs that conduct item analyses and provide the actual number of test takers that selected each possible response are particularly helpful in this process.
In reviewing items with Difficulty Indexes of .7 and above indicating that the item is too easy, with 70% of test takers getting the item correct, the test developer should be tipped that either the correct response is too transparent or obvious or the distractors (incorrect responses) do not represent viable answers. In looking at the distractors it is important to see if there is one that was selected more frequently than the others and use the wording of this distractor as a template for the other distractors. In that regard, all potential answers should be worded similarly, represent the same reading level, include common terms and be of similar length. Essentially, the key to writing items that are an appropriate difficulty level is to ensure that the item accurately reflects knowledge needed to perform the job and is written at the same level as the required knowledge with all distractors appearing as plausible as the keyed answer. Reading the stem and following it with each possible answer will give insights as to whether each distractor is plausible. In particular, this is valuable for double checking grammar, and subject verb agreement since any distractor that obviously doesn’t agree with the stem is a throwaway.
Experience has shown that item difficulty can be increased by revising them to incorporate five options instead of four with the fifth being “all of the above,” or “none of the above.” Similarly, items can be revised to incorporate multiple answers with the keyed response indicating for example that “both A and B are correct” or some similar variation. Conversely, items that are written in this format that proved to be too difficult can be revised to eliminate the options of “none of the above,” or “all of the above.” Note too, that in order for these options to truly be effective, there must be some items where the keyed response is “all of the above,” or “none of the above.” Writing and keying items in this fashion can increase the difficulty of a test so it is important to continue to evaluate their performance through ongoing item analysis procedures in the manner practiced by IPMA-HR. As indicated previously, it is often necessary to continue to gather item analysis information since the stability of the results and thus their reliability increases as the number of test takers increases. Most test developers endorse a sample of one hundred or more as a good number for establishing the reliability of item analysis results and recommend caution in applying item analysis with test takers below this number.
Also note that there may be times when some “easy” items are intentionally included in a written exam. Some test developers use such items to help build some confidence in test takers. If this is part of the test developer’s goal, easier items should be grouped together at the beginning of a test or test section. In addition, there may be an appropriate use of items that would be considered easy in a situation where a test is used to determine mastery rather ranking. That is, for example, if there is a position within your agency that needs to have knowledge of basic math to perform effectively and a threshold level can be established then everyone who meets that threshold would pass the test. Candidates would not be ranked since greater knowledge of math does not predict greater job performance, but the test would still be utilized even with relatively “easy” items because basic math is critical to the performance of the job.
Review of items with Difficulty that is below .3 indicating that an item is rather difficult with 30% or fewer candidates getting the item correct is akin to the review of items that are too easy. Again, the stem and possible answers must be reviewed to ensure the keyed answer is correct and also for the level of the knowledge being measured and the congruency of the stem with the possible answers. Response patterns can provide insight here as well particularly if there are distractors that garner an overwhelming percentage of the responses. These distractors need to be reviewed to develop theories as to what makes them so attractive and perhaps what can be done to reduce their level of plausibility.
Similar to the utilization of items that would normally be considered too easy; there may be times that using items considered difficult play a role in developing a test. Tests that are used for ranking need to provide a spread of candidates and difficult items can contribute to this spread by increasing the ceiling of the test. So difficult items can be valuable as long as they test knowledge where more is better and they are not written at a level beyond what is required for the job.
In reviewing items with a negative Discrimination Index, those items where those with the lowest test scores answered the item correctly more often than those with the highest test scores, one of the first things that should be checked is the keyed answer. Miskeyed items can frequently be the cause of a negative Discrimination Index.
Typically items with a low discrimination index are the result of items that encourage good test performers to over analyze a question and therefore see the possibility of one or more of the distractors being as plausible an answer as the keyed answer. This can also occur when items tend to be relatively easy and lead good performers to assume that there must be more to the item than seen at first glance. Again, correcting these items involves a review of answer patterns and developing hypotheses regarding what led to the observed results. Once the hypotheses are developed then the guidelines related to writing good test items can be applied to overcome the inadequacies of the items.
Essentially, improving test quality involves trial administrations of the test and then analysis of the items with hypotheses being developed in regard to what went wrong with an item and then applying the correct item development principles to overcome the issues with the item. Being a good test developer involves writing items applying what we know about what makes good items and then rewriting items based on information obtained from an item analysis. As stated previously, this is as much art as it is science and the ability to write good tests only comes through practice and research.
Mr. Burd’s next series will begin in March, covering topics such as successive hurdles, test weighting and certification rules.